[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

EP1188112A2 - Digital signal processor computation core - Google Patents

Digital signal processor computation core

Info

Publication number
EP1188112A2
EP1188112A2 EP00930720A EP00930720A EP1188112A2 EP 1188112 A2 EP1188112 A2 EP 1188112A2 EP 00930720 A EP00930720 A EP 00930720A EP 00930720 A EP00930720 A EP 00930720A EP 1188112 A2 EP1188112 A2 EP 1188112A2
Authority
EP
European Patent Office
Prior art keywords
computation
digital signal
operand
register
signal processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP00930720A
Other languages
German (de)
French (fr)
Inventor
William Caroll Anderson
John Edmondson
Jose Fridman
Marc Hoffman
Russel L. Rivin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Analog Devices Inc
Original Assignee
Analog Devices Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Analog Devices Inc filed Critical Analog Devices Inc
Priority to EP10184733A priority Critical patent/EP2267896A3/en
Priority to EP10183715.1A priority patent/EP2267596B1/en
Priority to EP10184831A priority patent/EP2267597A3/en
Publication of EP1188112A2 publication Critical patent/EP1188112A2/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • G06F9/3873Variable length pipelines, e.g. elastic pipeline
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • G06F9/3875Pipelining a single stage, e.g. superpipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator

Definitions

  • This invention relates to digital signal processors and, more particularly, to digital signal processor computation core architectures that facilitate complex digital signal processing computations.
  • a digital signal computer or digital signal processor (DSP) is a special purpose computer that is designed to optimize performance for digital signal processing applications, such as, for example, Fast Fourier transforms, digital filters, image processing and speech recognition.
  • Digital signal processor applications are typically characterized by real-time operation, high interrupt rates and intensive numeric computations.
  • digital signal processor applications tend to be intensive in memory access operations and to require the input and output of large quantities of data.
  • Digital signal processor architectures are typically optimized for performing such computations efficiently.
  • Microcontrollers involve the handling of data but typically do not require extensive computation.
  • Microcontroller application programs tend to be longer than DSP programs.
  • DSP programs In order to limit the memory requirements of microcontroller application programs, it is desirable to provide a high degree of code density in such programs.
  • architectures that are optimized for DSP computations typically do not operate efficiently as microcontrollers.
  • microcontrollers typically do not perform well as digital signal processors. Nonetheless, a particular application may require both digital signal processor and microcontroller functionality.
  • Digital signal processor designs may be optimized with respect to different operating parameters, such as computation speed and power consumption, depending on intended applications.
  • digital signal processors may be designed for 16-bit words, 32-bit words, or other word sizes.
  • a 32-bit architecture that achieves very high operating speed is disclosed in U.S. Patent No. 5,954,811 issued September 21, 1999 to Garde.
  • Digital signal processors frequently utilize architectures wherein two or more data words are stored in each row of memory, and two or more data words are provided in parallel to the computation unit. Such architectures provide enhanced performance, because several instructions and/or operands may be accessed simultaneously.
  • a computation unit is provided.
  • the computation unit is preferably configured for performing digital signal processor computations.
  • the computation unit comprises an execution unit for performing an operation on a first operand and a second operand in response to an instruction, a register file for storing operands, first and second operand buses coupled to the register file, and first and second data selectors.
  • the first and second operand buses each carry a high operand and a low operand.
  • the first data selector supplies the high operand or the low operand from the first operand bus to the execution unit in response to a first operand select value contained in the instruction.
  • the second data selector supplies the high operand or the low operand from the second operand bus to the execution unit in response to a second operand select value contained in the instruction.
  • the execution unit may comprise an arithmetic logic unit, a multiplier and an accumulator.
  • the register file comprises first and second register banks, each having two read ports and two write ports. In another embodiment, the register file comprises a single register bank having four read ports and four write ports.
  • a computation unit comprises an execution unit for performing an operation on first and second operands in response to an instruction, a register file for storing operands, an operand bus coupled to the register file, the operand bus carrying a high operand and a low operand, and a data selector, responsive to an operand select value contained in the instruction, for supplying the high operand or the low operand from the operand bus to the execution unit.
  • a method for performing a digital computation.
  • the method comprises the steps of storing operands for the computation in a register file, supplying operands from the register file on first and second operand buses, each carrying a high operand and a low operand, selecting the high operand or the low operand from the first operand bus in response to a first operand select value contained in an instruction and supplying a selected first operand to the execution unit, selecting the high operand or the low operand from the second operand bus in response to a second operand select value contained in the instruction and supplying a selected second operand to the execution unit, and performing an operation specified by the instruction on the operands selected from the first and second operand buses.
  • a digital signal processor computation unit comprises first and second execution units for performing operations in response to an instruction and for producing first and second results, a result register for storing the results of the operations, the result register having first and second locations, and result swapping logic, coupled between the first and second execution units and the result register, for swapping the first and second results between the first and second locations in the result register in response to result swapping information contained in the instruction.
  • the first and second execution units may comprise first and second arithmetic logic units for performing add and subtract operations.
  • the first and second execution units are separately controllable and may perform the same or different operations in response to operation code information contained in the instruction.
  • the first and second arithmetic logic units may comprise 16-bit arithmetic logic units which are configurable as a 32-bit arithmetic logic unit.
  • the first and second locations in the result register may comprise high and low halves of the result register.
  • the result register may comprise a register in a register file. According to another aspect of the invention, a method is provided for performing digital signal computations.
  • the method comprises the steps of performing operations in first and second execution units in response to an instruction and producing first and second results, storing the results of the operations in a result register having first and second locations, and swapping the first and second results with respect to the first and second locations in the result register, in response to result swapping control information contained in the instruction.
  • a digital signal processor computation unit comprises first and second execution units for performing operations in response to an instruction and for producing first and second results, a result register for storing the results of the operations, the result register having first and second locations, and means for swapping the first and second results with respect to the first and second locations in the result register, in response to result swapping control information contained in the instruction.
  • a digital signal processor computation core comprises first and second execution units for performing first and second operations in response to control signals, and control logic for providing the control signals to the first and second execution units in response to control information contained in an instruction for individually controlling the first and second operations.
  • the first and second execution units comprise first and second arithmetic logic units.
  • the first and second operations may be selected from add operations and subtract operations, and may be the same or different.
  • the computation core may further comprise a register file for storing operands and results of the first and second operations, and first and second operand buses coupled between the register file and the first and second execution units, each of the first and second operand buses carrying a high operand and a low operand, wherein the first execution unit performs the first operation on the high operands and the second execution unit performs the second operation on the low operands.
  • a method is provided for performing digital signal computations. The method comprises the steps of performing first and second operations in first and second execution units, and individually controlling the first and second operations in response to control information contained in an instruction.
  • a digital signal processor computation core is provided.
  • the digital signal processor computation core comprises first and second execution units for performing first and second operations in response to control signals, and means responsive to control information contained in an instruction for providing the control signals to the first and second execution units for individually controlling the first and second operations, wherein the first and second operations may be the same or different.
  • a computation core is provided for executing programmed instructions.
  • the computation core comprises an execution block for performing digital signal processor operations in response to digital signal processor instructions and for performing microcontroller operations in response to microcontroller instructions, a register file for storing operands for and results of the digital signal processor operations and the microcontroller operations, and control logic for providing control signals to the execution block and the register file in response to the digital signal processor instructions and the microcontroller instructions for executing the digital signal processor instructions and the microcontroller instructions.
  • the digital signal processor instructions are configured for high efficiency digital signal computations
  • the microcontroller instructions are configured for code storage density.
  • the microcontroller instructions have a 16-bit format and the digital signal processor instructions have a 32-bit format.
  • the digital signal processor instructions may contain information indicating whether one or more related instructions follow.
  • the related instructions may comprise load instructions.
  • a method for executing programmed instructions.
  • the method comprises the steps of executing digital signal processor operations in an execution block in response to digital signal processor instructions configured for efficient digital signal computation, and executing microcontroller operations in the execution block in response to microcontroller instructions configured for code storage density.
  • An application program having a mixture of digital signal processor instructions - 7 -
  • microcontroller instructions is characterized by high code storage density and efficient digital signal computation.
  • a digital signal processor having a pipeline structure comprises a computation block for executing computation instructions, the computation block having one or more computation stages of the pipeline structure, and a control block for fetching and decoding the computation instructions and for accessing a memory, the control block having one or more control stages of the pipeline structure.
  • the computation stages and the control stages are positioned in the pipeline structure such that a result of the memory access is available to the computation stages without stalling the computation stages.
  • the computation stages and the control stages may be positioned in the pipeline structure so as to avoid stalling the computation stages when a computation instruction immediately follows a memory access instruction and requires the result of the memory access instruction.
  • the computation stages and the control stages may be positioned in the pipeline structure such that the control block has one or more idle stages following completion of the memory access.
  • the computation stages and the control stages may be positioned in the pipeline structure such that the computation block has one or more idle stages prior to a first computation stage.
  • a method for a digital signal computation.
  • the method comprises the steps of executing computation operations in a computation block having one or more computation stages, executing control operations, including fetching instructions, decoding instructions and accessing a memory, in a control block having one or more control stages, wherein the computation stages and the control stages are configured in a pipeline structure, and positioning the computation stages relative to the control stages in the pipeline structure such that a result of a memory access is available to the computation stages without stalling the computation stages.
  • a method for determining an output of a finite impulse response digital filter having L filter coefficients in response to a set of M input samples.
  • the method comprises the steps of (a) loading a first input sample into a first location in a first register, (b) loading a second input sample into a second location in the first register, (c) loading two coefficients into a second register, (d) computing intermediate results using the contents of the first and second registers, (e) loading a new input sample into the first location in the first register, (f) computing intermediate results using the contents of the first and second registers, (g) repeating steps (b)-(f) for L iterations to provide two output samples, and (h) repeating steps (a)-(g) for M/2 iterations to provide M output samples.
  • Step (d) may comprise a multiply accumulate operation on a first coefficient in the second register and the input sample in the first location in the first register, and a multiply accumulate operation on the first coefficient in the second register and the input sample in the second location in the first register.
  • Step (f) may comprise a multiply accumulate operation on a second coefficient in the second register and the input sample in the first location in the first register, and a multiply accumulate operation on the second coefficient in the second register and the input sample in the second location in the first register.
  • FIG. 1 is a block diagram of a computation core in accordance with an embodiment of the invention
  • FIG. 2 is a block diagram of a digital signal processor incorporating the computation core of FIG. 1 ;
  • FIG. 3 is a more detailed block diagram of a portion of the computation core, showing a first embodiment of the register file;
  • FIG. 3 A is a more detailed block diagram of a portion of the computation core, showing a second embodiment of the register file
  • FIG. 4 is a block diagram of the execution units shown in FIG. 1 ;
  • FIG. 4A is a more detailed block diagram of a portion of one of the execution units shown in FIG. 4;
  • FIG. 5 schematically illustrates an example of the operation of the multiplier/accumulators in the execution units
  • FIG. 6A-6D schematically illustrate examples of the selection of different operands by one of the multiplier/accumulators
  • FIG. 7A-7D schematically illustrate examples of dual 16-bit arithmetic logic unit (ALU) operations which may be performed by the ALUs in the execution units;
  • ALU arithmetic logic unit
  • FIG. 7E schematically illustrates an example of a quad 16-bit ALU operation which may be performed by the ALUs in the execution units;
  • FIG. 8 schematically illustrates the swapping of results produced by the ALUs
  • FIG. 9 schematically illustrates an example of a 32-bit DSP multiply accumulate instruction format that may be used in the computation core of FIG. 1;
  • FIG. 10 schematically illustrates an example of a 32-bit ALU instruction format that may be used in the computation core of FIG. 1
  • FIG. 11 schematically illustrates an example of a 16-bit microcontroller instruction format that may be used in the computation core of FIG. 1 ;
  • FIG. 12 schematically illustrates the operation of the pipeline in the computation core of FIG. 1 ;
  • FIG. 13 schematically illustrates the operation of a prior art pipeline;
  • FIG. 14 is a block diagram that illustrates an embodiment of the pipeline structure in the computation core of FIG. 1 ;
  • FIGS. 15A-15C schematically illustrate the operation of an FIR digital filter algorithm that may run efficiently on the computation core of FIG. 1 ; and FIG. 16 shows pseudo-code for an example of an FIR digital filter algorithm that may run efficiently on the computation core of FIG. 1
  • FIG. 1 A block diagram of an embodiment of a computation core 10 in accordance with the invention is shown in FIG. 1.
  • a block diagram of a digital signal processor 20 incorporating computation core 10 is shown in FIG. 2.
  • digital signal processor 20 is implemented as a monolithic integrated circuit which incorporates computation core 10.
  • Computation core 10 includes a computation block 24 and an addressing block 26 coupled through operand buses 30 and result buses 32 to a memory interface 34. Address buses 40 and 42 are coupled between addressing block 26 and memory interface 34. Computation core 10 further includes an instruction sequencer 50 coupled by an instruction address bus 52 and an instruction bus 54 to memory interface 34. Memory interface 34 is connected by memory buses 60 and 62 to a memory 64 (FIG. 2), including memory banks 70, 72, 74 and 76, located external to computation core 10.
  • computation block 24 includes a register file 80 and execution units 82 and 84, each of which are connected to operand buses 30 and result buses 32.
  • Execution unit 82 (execution unit 0) includes an arithmetic logic unit (ALU) 90, a multiplier 92, an accumulator 94, and a shifter 96.
  • Execution unit 84 (execution unit 1) includes an ALU 100, a multiplier 102, and an accumulator 104. The structure and operation of computation block 24 are described in detail below.
  • the addressing block 26 includes an address register file 120 and data address generators 124.
  • address register file 120 has a capacity of 8 address values.
  • the address register file 120 may be used for microcontroller programs that require simple addressing, and may access different word widths (8-bit bytes, 16-bit half words, and 32-bit words).
  • the addressing block 26 may include four data address generators (DAGs) 124 for generating address sequences or patterns.
  • DAGs data address generators
  • the addresses generated by addressing block 26 are supplied through address buses 40 and 42, memory interface 34 and memory buses 60 and 62 to memory 64 (FIG. 2).
  • Instruction sequencer 50 includes a loop buffer 130, an instruction decoder 132 and sequencer/control logic 134.
  • Instructions are received from memory 64 through one of the memory buses 60 or 62 and are delivered to the instruction sequencer 50 via instruction bus 54.
  • the instructions are temporarily stored in loop buffer 130.
  • the loop buffer 130 is used for implementing repetitive code sequences with no overhead.
  • the instructions are decoded in the instruction decoder 132 and are interpreted by the sequencer/control logic 134 to control operations by the rest of the computation core.
  • the integration of computation core 10 into digital signal processor 20 is shown in FIG. 2.
  • Core 10 is connected to the other elements of the digital signal processor 20 through memory buses 60 and 62.
  • the digital signal processor 20 may further include a memory bus 150, which is not connected to computation core 10 and an industry standard bus 152, also not connected to computation core 10.
  • Standard bus 152 may, for example, be a Peripheral Components Interconnect (PCI) bus and may be connected to memory buses 60, 62 and 150 through a peripheral bus bridge 154.
  • PCI Peripheral Components Interconnect
  • memory buses 60, 62 and 150 are connected to memory banks 70, 72, 74 and 76, peripheral bus bridge 154, a DMA controller 160 and an external memory bus controller 162.
  • the external memory bus controller 162 permits the digital signal processor 20 to be connected to an external memory via an external memory bus 164.
  • the standard bus 152 may be connected to a custom peripheral interface 170, a serial port 172, a microcontroller host port 174, an FPGA (field programmable gate array) based peripheral 176, a custom algorithm accelerator 178 and a purchased peripheral interface 180. It will be understood that different elements may be added to or removed from the digital signal processor 20 for different applications.
  • register file 80 has eight registers and is partitioned into register file banks 200 and 202, each having four registers of 32 bits each.
  • register file bank 200 contains registers R0-R3
  • register file bank 202 contains registers R4-R7. This arrangement results in low power because each four entry register file bank 200, 202 requires less energy per access than a single eight entry register file.
  • Each four entry register file bank 200, 202 requires two read ports and two write ports, while an eight entry register file requires four read ports and four write ports.
  • Register file 80 is connected to execution units 82 and 84 and to memory 64 by operand buses 30 and result buses 32.
  • Operand buses 30 include operand bus 210, operand bus 212, operand bus 214 and operand bus 216.
  • Operand buses 210 and 212 are connected between register file banks 200 and 202 and memory 64 for writing results of computations to memory. In another embodiment, a single operand bus may be used for writing data from register file 80 to memory 64.
  • Operand buses 214 and 216 are connected between register file banks 200 and 202 and execution units 82 and 84 for supplying operands to execution units 82 and 84.
  • Result buses 32 include result bus 220, result bus 222, result bus 224 and result bus 226.
  • Result buses 220 and 222 are connected between memory 64 and register file banks 200 and 202 for reading operands from memory 64.
  • Result buses 224 and 226 are connected between execution units 82 and 84 and register file banks 200 and 202 for writing results of computations in register file 80.
  • each of operand buses 210, 212, 214 and 216 and each of result buses 220, 222, 224 and 226 is 32 bits wide.
  • memory 64 is external to computation core 10.
  • the connections to memory 64 are via memory interface 34 and memory buses 60 and 62, as described above in connection with FIGS. 1 and 2.
  • a block diagram of a second embodiment of register file 80, execution units 82 and 84 and memory 64, and the interconnection between these elements, is shown in FIG.
  • register file 80 has a single register file bank 240 having eight registers, R0-R7, of 32 bits each.
  • Register file bank 240 has four read ports and four write ports.
  • a block diagram of execution units 82 and 84 is shown in FIG. 4.
  • a portion of execution unit 82 is shown in more detail in FIG. 4A.
  • Execution unit 82 includes a multiplier array 250, an ALU 252, an accumulator 254 and a barrel shifter 256.
  • Execution unit 84 includes a multiplier array 260, an ALU 262 and an accumulator 264.
  • Each multiplier array 250, 260 receives two 16-bit operands and provides two 32-bit outputs to the respective ALUs 252, 262.
  • ALUs 252 and 262 may also receive two 32-bit inputs from operand buses 214 and 216.
  • ALUs 252 and 262 are 40-bit ALUs.
  • the output of ALU 252 is connected to accumulator 254 and is connected through a result swap mux (multiplexer) 280 to one input of an output select mux 282.
  • the output of accumulator 254 is connected to a second input of output select mux 282 and is connected to an input of ALU 252.
  • the output of ALU 262 is connected to accumulator 264 and is connected through a result swap mux 284 to an output select mux 286.
  • the output of accumulator 264 is connected to a second input of output select mux 286 and to an input of ALU 262.
  • the output of output select mux 282 is connected to result bus 226, and the output of output select mux 286 is connected to result bus 224.
  • multiplier arrays 250 and 260 and the ALUs 252 and 262 are utilized.
  • multiply accumulate (MAC) operations multiplier arrays 250 and 260, ALU's 252 and 262 and accumulators 254 and 264 are utilized.
  • add/subtract operations ALUs 252 and 262 are utilized. The appropriate outputs are selected by output select muxes 282 and 286 and are supplied on result buses 226 and 224 to register file 80.
  • the operations of the execution units 82 and 84 are described in more detail below.
  • FIG. 4 shows only the data paths in the execution units 82 and 84.
  • Each element of execution units 82 and 84 receives control signals from the sequencer/control logic 134 (FIG. 1) for controlling operations in accordance with instructions being executed.
  • Each of the operand buses 214 and 216 is 32 bits wide and carries two
  • the computation block 24 is preferably provided with an arrangement of data selectors which permits the multiplier in each of execution units 82 and 84 to select the high or low operand from each of the operand buses 214 and 216.
  • a mux (data selector) 300 selects the high operand or the low operand from operand bus 214 for input to multiplier array 250
  • a mux 302 selects the high operand or the low operand from operand bus 216 for input to multiplier array 250.
  • a mux 310 selects the high operand or the low operand from operand bus 214 for input to multiplier array 260
  • a mux 312 selects the high operand or the low operand from operand bus 216 for input to multiplier array 260.
  • the select inputs to muxes 300, 302, 310 and 312 are controlled in response to information contained in instructions as described below. This arrangement for selecting operands provides a high degree of flexibility in performing digital signal computations.
  • a schematic representation of a dual multiply accumulate operation by execution units 82 and 84 is shown in FIG. 5. Like elements in FIGS. 4 and 5 have the same reference numerals.
  • a 32-bit data element 340 represents the operands supplied from register file 80 on operand bus 214, and a 32-bit data element 342 represents the operands supplied from register file 80 on operand bus 216.
  • a 40-bit data element 344 represents the contents of accumulator 254, and a 40-bit data element 346 represents the contents of accumulator 264.
  • Multiplier array 250 receives the low operands from data elements 340 and 342 and supplies an output to ALU 252.
  • ALU 252 adds the output of multiplier array 250 and data element 344 and places the result in accumulator 254 as a new data element 344.
  • multiplier array 260 receives the high operands from data elements 340 and 342 and supplies an output to ALU 262.
  • ALU 262 adds the output of multiplier array 260 and data element 346 from accumulator 264 and places the result in accumulator 264 as a new data element 346.
  • muxes 300 and 302 select the low operands from operand buses 214 and 216 and supply the low operands to multiplier array 250.
  • Muxes 310 and 312 select the high operands from operand buses 214 and 216 and supply the high operands to multiplier array 260.
  • FIGS. 6A-6D Selection of different operands for computation by execution unit 82 is illustrated in the schematic representations of FIGS. 6A-6D. Like elements in FIGS. 4, 5 and 6A-6D have the same reference numerals.
  • the low operand of data element 340 and the low operand of data element 342 are supplied to multiplier array 250.
  • FIG. 6B the high operand of data element 340 and the low operand of data element 342 are supplied to multiplier array 250.
  • FIG. 6C the low operand of data element 340 and the high operand of data element 342 are supplied to multiplier array 250.
  • FIG. 6D the high operand of data element 340 and the high operand of data element 342 are supplied to multiplier array 250.
  • the data element 340 appears on operand bus 214 (FIG. 4), and the data element 342 appears on operand bus 216.
  • the selection of operands for multiplier array 250 is made by muxes 300 and 302, as shown in FIG. 4.
  • muxes 310 and 312 perform operand selection for multiplier array 260.
  • the muxes 300, 302, 310 and 312 are controlled by select signals derived from instructions being executed, as described below.
  • the operand selection technique is described above in connection with dual multiply accumulate (MAC) units. However, since this technique relates to the data movement and selection aspects of computation, it is generally applicable to data selection for any execution unit that performs any arbitrary arithmetic operation. In addition, although the description relates to selection of one of two 16-bit operands, the operand selection technique can be implemented with operands of any width and with two or more operands. When using the operand selection technique, the programmer selects two pairs of adjacent 16-bit data elements that reside in register file 80.
  • the programmer selects a high or low 16-bit operand from a 32-bit data element to serve as one input to one of the MACs.
  • the other input to the same MAC is a high or low 16-bit operand selected from the other operand bus.
  • the execution units 82 and 84 also execute instructions which specify ALU operations, i.e., operations which involve addition or subtraction and which do not require the multiplier array or the accumulator.
  • the ALUs 252 and 262 may be configured for performing various ALU operations. In most cases, only one of the ALU's 252 and 262 is active in performing ALU operations. An exception is shown in FIG.
  • ALU operations are described in connection with ALU 252 and execution unit 82. It will be understood that the same ALU operations can be performed by ALU 262 in execution unit 84.
  • ALU 252 performs a 32- bit add or subtract and outputs a 32-bit result through result swap mux 280 and output select mux 282 to result bus 226.
  • the ALU 252 may be configured for performing two 16-bit addition or subtraction operations, as illustrated in FIGS. 7A-7D.
  • 32-bit ALU 252 is configured to function as two 16-bit ALUs 360 and 362 (FIG. 4A).
  • a 32-bit ALU may be configured as two independent 16-bit ALUs by blocking the carry from bit 15 into bit 16.
  • ALU 360 adds the high operands of data elements 340 and 342 and places the 16-bit result in a high result portion of a data element 364.
  • ALU 362 adds the low operands of data elements 340 and 342 and places the result in a low result portion of data element 364.
  • the 32-bit data element 364 is supplied on result bus 226 to register file 80.
  • FIGS. 7A, 7B, 7C and 7D illustrate the fact that 16-bit ALUs 360 and 362 are separately programmable in response to control 1 and control 0 signals (FIG. 4A), and may perform the same or different operations.
  • FIG. 7A illustrates the case where ALU 360 and ALU 362 both perform add operations.
  • FIG. 7B illustrates the case where ALU 360 and ALU 362 both perform subtract operations.
  • FIG. 7C illustrates the case where ALU 360 performs an add operation and ALU 362 performs a subtract operation.
  • FIG. 7D illustrates the case where ALU 360 performs a subtract operation and ALU 362 performs an add operation.
  • the control 1 and control 0 signals are supplied from instruction decoder 132 (FIG.
  • ALU operations typically utilize only one of the execution units 82 and 84.
  • An exception is described with reference to FIG. 7E.
  • the 16-bit ALU 360 subtracts the high operands of data elements 340 and 342 and places the result in the high result portion of data word 364.
  • the 16-bit ALU 362 adds the low operands of data elements 340 and 342 and places the result in the low result portion of data element 364.
  • This configuration further utilizes 32-bit ALU 262 in execution unit 84 configured as 16-bit ALUs 370 and 372.
  • the 16-bit ALU 370 adds the high operands of data elements 340 and 342 and places the result in a high result portion of a data element 374.
  • the 16-bit ALU 372 subtracts the low operands of data elements 340 and 342 and places the result in a low result portion of data element 374.
  • Data element 374 is supplied on result bus 224 to register file 80.
  • execution units 82 and 84 simultaneously perform four 16-bit ALU operations.
  • FIGS. 7A-7D illustrate a configuration where 16-bit ALU 360 and 16-bit ALU 362 are separately programmable and the operations performed by ALUs 360 and 362 may be the same or different.
  • an ALU instruction includes operation fields which specify the individual operations to be performed by ALUs 360 and 362. This individual control feature is generally applicable to any execution units that perform two or more operations simultaneously.
  • the multiplier accumulators in execution units 82 and 84 are individually controllable and may perform the same or different operations.
  • a multiplier accumulator instruction includes operation fields which individually specify the operations to be performed by execution units 82 and 84.
  • the individual control feature can be implemented with execution units of any type or width, and with two or more execution units, or with a single execution unit having two or more computation devices.
  • a further feature of execution units 82 and 84 is described with reference to FIGS. 4 A and 8.
  • the results generated by 16-bit ALUs 360 and 362 may be reversed, or swapped, with respect to their placement in 32-bit data element 364. Specifically, the output of ALU 360 is supplied to the low result portion of data element 364, and the output of ALU 362 is supplied to the high result portion of data element 364. This reversed or swapped configuration is contrasted with the configuration of FIGS.
  • result swap mux 280 may include a mux 380, which is controlled by a swap signal, and a mux 382, which is controlled by an inverted swap signal.
  • Each of the muxes 380 and 382 receives the 16-bit results from ALUs 360 and 362.
  • the swap signal is not asserted, the output of ALU 360 is supplied to the high result portion of result bus 226, and the output of ALU 362 is supplied to the low result portion of result bus 226.
  • the swap signal is asserted, the output of ALU 360 is supplied to the low result portion of result bus 226, and the output of ALU 362 is supplied to the high result portion of result bus 226, thereby swapping the outputs of ALUs 360 and 362.
  • output select mux 282 (FIG. 4) is omitted for simplicity of illustration. - 20 -
  • the result swapping technique is described above in connection with swapping of ALU outputs. However, since this technique relates to the data movement aspects of computation, it is generally applicable to result swapping for any execution unit that produces two or more results. As described below in connection with FIG. 10, an ALU instruction includes a field which specifies whether or not the results of the ALU operations are to be swapped. The result swapping technique can be implemented with results of any width and with two or more results.
  • a multiplier accumulator instruction 400 has a 32-bit format, with the fields of the instructions as shown in FIG. 9.
  • Source fields, srcO and srcl, each having three bits, identify the registers in register file 80 which are to provide the operands for the computation on operand buses 214 and 216 (FIG. 4).
  • a three bit destination field, dst, identifies the register in register file 80 where the result of the computation is to be stored.
  • Operation fields, opO and opl each having two bits, indicate the operations to be performed by execution units 82 and 84, respectively.
  • the operations include multiply, multiply-add, multiply-subtract and no operation.
  • the multiply-add and multiply-subtract operations are MAC operations.
  • a P field indicates whether the result is to be written to a single register or written to a register pair.
  • Two w fields wl, and wO, indicate whether the result is to be accumulated only or accumulated and written to a register.
  • the wl field applies to execution unit 82, and the wO field applies to execution unit 84.
  • An hOO field indicates whether to select the high operand or the low operand of source 0 (srcO) for execution unit 82.
  • An hlO field indicates whether to select the high operand or the low operand of source 1 (srcl) for execution unit 82.
  • An hOl field indicates whether to select the high operand or the low operand of source 0 for execution unit 84.
  • An hi 1 field indicates whether to select the high operand or the low operand of source 1 for execution unit 84.
  • the hOO and hlO fields control muxes 300 and 302, respectively, at the inputs to execution unit 82, and the hOl and hi 1 fields control muxes 310 and 312, respectively, at the inputs to execution unit 84.
  • An MM field indicates whether or not execution unit 84 is in mixed mode (signed/unsigned).
  • An mmod field indicates fraction or integer operation, signed or unsigned operation, round or truncate operation and scaled or unsealed operation.
  • An M field indicates whether or not two load/store instructions follow the instruction.
  • An example of a DSP type ALU instruction format for controlling execution units 82 and 84 to perform ALU operations is shown in FIG. 10.
  • An ALU instruction 450 has a 32-bit format. As in the case of the multiply accumulate instruction, the M field indicates whether or not two load/store instructions follow the instruction.
  • An operation code field, aopcde is used in conjunction with a secondary op code field, aop, to specify a particular arithmetic operation. Examples include single 16-bit ALU operations, single 32-bit ALU operations, dual 16-bit ALU operations and quad 16-bit ALU operations, as well as other arithmetic operations known to those skilled in the art.
  • Source fields, srcO and srcl each having 3 bits, specify the source registers in register file 80 containing 32-bit data elements for the computation.
  • An HL field indicates whether the result of a single ALU operation is to be deposited in the high half or the low half of the destination register.
  • An x field indicates whether or not two 16-bit results are to be swapped as they are deposited in the destination register. The value contained in the x field controls the operation of result swap mux 280 (FIG. 4A) as described above.
  • An s field determines whether saturation is active or inactive.
  • the aop field indicates the two operands that are to be added or subtracted, i.e., low and low; low and high; high and low; or high and high.
  • the HL field indicates whether the 16-bit result is to be deposited in the high or low half of the destination register.
  • the aop field indicates the two operations to be performed by the two 16-bit ALUs, i.e., add/add; add/subtract; subtract/add; or subtract/subtract.
  • the aop field controls the individual operations performed by ALUs 360 and 362 (see FIGS. 7A-7D).
  • the aop field controls the operations performed by 16-bit ALUs 360, 362, 370 and 372 (FIG. 7E).
  • the possible operations are add/add for one execution unit and subtract/subtract for the other execution unit, or add/subtract for one execution unit and subtract/add for the other execution unit, to avoid redundant calculations.
  • the aopcde field in instruction 450 may also specify a 32-bit add or subtract operation.
  • FIGS. 9 and 10 and described above are DSP instructions. These instructions are characterized by a high degree of flexibility and include optional features to permit efficient digital signal processor computations.
  • An example of a microcontroller type instruction format for controlling execution units 82 and 84 to perform arithmetic operations is shown in FIG. 11.
  • An instruction 480 has a length of 16 bits and contains only three fields, a 4-bit operation code field, ope, a 3-bit source field, src, and a 3-bit destination field, dst.
  • the input operands are taken from the registers in register file 80 specified by the src and dst fields.
  • the result of the computation is placed in the register specified by the dst field, thereby overwriting one of the operands.
  • the operation code field, ope may specify add, subtract, multiply, as well as other arithmetic operations known to those skilled in the art. It may be observed that instruction 480 is relatively simple and has only three fields that may be specified by the programmer. However, because instruction 480 has a length of 16 bits, it occupies only half of the memory space that is occupied by the more complex DSP instructions described above. As described above, code density is an important factor in microcontroller applications. A typical microcontroller application may have a relatively large number of instructions requiring relatively simple computations and data handling. Because the number of instructions in a microcontroller application may be large, code density is an important factor in minimizing memory requirements.
  • DSP applications typically include a relatively small number of instructions which may be executed repetitively in performing DSP computations.
  • code density is less important than efficient execution in achieving high performance in DSP applications.
  • microcontroller and DSP functions may be combined efficiently in a single computation core.
  • a combined application typically includes a relatively large number of 16-bit microcontroller instructions and a relatively small number of 32-bit DSP instructions, thereby achieving a high degree of code density.
  • the relatively small number of DSP instructions can be optimized for the highest performance in executing DSP computations.
  • the computation core 10 preferably has a pipeline architecture, as illustrated in FIGS. 12 and 14.
  • the pipeline has eight stages.
  • each stage performs a specified function of instruction execution, permitting multiple instructions to be executed simultaneously, with each instruction having a different phase of execution.
  • FIG. 12 is a pipeline timing diagram wherein a horizontal row of blocks represents the functions performed by the different stages of the pipeline in executing a single instruction.
  • row 500 represents execution of a first instruction
  • row 502 represents execution of a second instruction.
  • Vertically aligned blocks represent functions that are performed simultaneously by different stages in the pipeline.
  • stages 0 and 1 perform instruction fetch (IF) from an instruction cache 510 (FIG. 14).
  • Stage 2 performs instruction decoding (ID) in instruction decoder 132.
  • Stage 3 performs data address generation (DAG) in DAG 124.
  • Stages 4 and 5 perform data memory access (Ml and M2) in memory 64.
  • the instruction fetch, instruction decode, data address generation and memory access functions are performed by the control section of computation core 10, including instruction sequencer 50 and addressing block 26 (FIG. 1).
  • Stages 4-7 include operations performed by computation block 24.
  • Stage 4 performs register file read (RFR) from register file 80.
  • Stages 5 and 6 perform multiply accumulate operations (MAC1 and MAC2) in execution units 82 and 84.
  • stage 5 the MAC1 operation of stage 5 is executed by multiplier arrays 250 and 260
  • the MAC2 operation of stage 6 is executed by ALUs 250 and 262 and accumulators 254 and 264.
  • Arithmetic logic and shift operations (EX) of stage 6 are executed by ALUs 252 and 262 or barrel shifter 270.
  • the stage 7 operation is a register file write (RFW) from execution units 82 and 84 to register file 80.
  • pipeline stages are separated by latches 508 controlled by a system clock, as known in the art.
  • the pipeline shown in FIGS. 12 and 14 and described above is optimized for achieving high performance when executing DSP code.
  • a feature of the pipeline is that memory access operations (Ml and M2), such as loads and stores, occur early in the pipeline relative to the computation operations (EX, MAC1 and MAC2), thus achieving early memory access.
  • FIG. 12 this is illustrated by the arrow from the end of the second memory access stage (M2) in row 500 to the beginning of the first computation stage (MAC1) in row 502.
  • the arrow represents a register file bypass operation wherein data loaded from memory is supplied directly to execution units 82 and 84, and register file 80 is bypassed.
  • MAC multiply accumulate
  • the memory access operations (DAG, Ml, and M2) occur relatively early in the pipeline and result in two idle pipeline stages, stages 6 and 7, in the control section of the computation core.
  • the computation operations (MAC1 and MAC2) occur relatively late in the pipeline and result in one idle stage (DAG), stage 3, in the computation block 24 of the computation core.
  • FIG. 13 A timing diagram for a conventional pipeline is illustrated in FIG. 13. As shown, memory access operations (DAG, Ml, and M2) occur late in the pipeline relative to the computation operations (MAC1 and MAC2). In particular, memory access operations (Ml and M2) and computation operations (MAC1 and MAC2) both occur in stages 4 and 5. As a result, a one cycle stall is required between a load instruction and a computation instruction that immediately follows the load instruction. The stall may have a significant impact on performance where the sequence of instructions is contained in a loop that is executed multiple times. By contrast, the pipeline structure shown in FIGS. 12 and 14 does not require a stall between a load instruction and a computation instruction.
  • the early memory access pipeline structure shown in FIG. 12 has advantages in comparison with the prior art pipeline structure shown in FIG. 13. Load-to-use latencies in processors with execution units that have multiple pipeline stages are eliminated. Normally, processors with this type of execution unit suffer from load-to-use latencies. Elimination of load-to-use latencies results in simpler software that does not require loop unrolling or software pipelining, which are software techniques used to improve performance in processors with load-to-use latencies. Even when these techniques are applied, the performance of a conventional processor may be lower than that of the pipeline structure shown in FIGS. 12 and 14 and described above.
  • FIR filter may be defined mathematically as
  • x (n) are samples of an input signal
  • c(k) are L filter coefficients
  • z(n) are output signal samples.
  • Each output z(n) is obtained by computing the vector product of L samples of the input signal x(n) times L filter coefficients c(k) and summing the products. All signals and coefficients are 16-bit data values in this example.
  • the dual multiply accumulate operations shown in FIGS. 5 and 6A-6D and described above, may be utilized to perform FIR filter computations.
  • execution units 82 and 84 may be utilized to perform two multiply accumulate operations simultaneously.
  • a conventional implementation of an FIR filter on a DSP with dual execution units would require that a total of four data values be loaded from memory: two input values from x(n) and two filter coefficients from c(n). These data loads are achieved by loading a pair of adjacent data values and a pair of adjacent filter coefficient values.
  • a problem with this technique is that for half of the total number of memory accesses, the pairs of data values must come from locations that are not 32-bit aligned in memory.
  • the memory must be able to deliver data elements x(0) and x(l) into a register in an aligned 32-bit access, and must also be able to deliver data elements x(l) and x(2) to a register in a misaligned 32-bit access.
  • the data elements x(n) or the coefficients c(n) must be accessed as misaligned 32-bit element pairs, but not both.
  • One of these signals may always be accessed as 32-bit aligned pairs, and here it is assumed that coefficients c(n) are accessed as aligned 32-bit pairs.
  • the delivery of misaligned 32-bit element pairs in prior art systems requires two memory accesses and, therefore, is relatively inefficient.
  • a novel FIR filter implementation avoids misaligned 32 bit data accesses as follows. Let execution unit 82 (MACO) compute all of the even indexed outputs and execution unit 84 (MAC1) compute all of the odd indexed outputs. For example, outputs z(0) through z(3) may computed as follows.
  • execution units 82 computes z(0) and z(2)
  • execution unit 84 computes z(l) and z(3).
  • the value z(0) is computed in execution unit 82, and the value z(l) is computed in execution unit 84.
  • coefficient pair c(2) and c(3) is loaded into register RI and a single data sample x(3) is loaded into the high half of register RO, as shown in FIG. 15C.
  • the low half of register RO is not changed.
  • the two multiply accumulate computations are performed as follows.
  • FIG. 16 A pseudo-code representation of an algorithm for performing FIR digital filter computations as described above is shown in FIG. 16.
  • the algorithm includes an outer loop and an inner loop.
  • the outer loop is executed M/2 times, where M is the number of input data samples in the data set. Since two output values are computed on each pass of the outer loop, M/2 iterations are required.
  • a 16-bit data element x(0) is loaded into register RLO, the lower half of register RO, and the inner loop is executed L times, where L is the number of coefficients in the FIR filter.
  • the inner loop performs the multiply accumulate operations for values of an index variable k for values of k from 0 to L-1.
  • a 16-bit data element x(n+k+l) is loaded into register RHO, the high half of register RO.
  • Two 16-bit coefficients c(k+l) and c(k) are loaded into register RI .
  • the multiply accumulate value z(n- ⁇ -l) is computed in execution unit 84, and the result is stored in accumulator Al.
  • the multiply accumulate value z(n) is computed in execution unit 82, and the result is stored in accumulator A0.
  • a 16-bit data element x(n+k+2) is loaded into register RLO, the low half of register RO, and the multiply accumulate values z(n+l) and z(n) are computed.
  • the inner loop is executed L times. While there have been shown and described what are at present considered the preferred embodiments of the present invention, it will be obvious to those skilled in the art that various changes and modifications may be made therein without departing from the scope of the invention as defined by the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)
  • Executing Machine-Instructions (AREA)
  • Image Processing (AREA)
  • Filters That Use Time-Delay Elements (AREA)
  • Error Detection And Correction (AREA)

Abstract

A computation core includes a computation block, an addressing block and an instruction sequencer, which are coupled to a memory through a memory interface. The computation block includes a register file and dual execution units. The execution units include features for enhanced performance in executing digital signal computations. The computation core is configured for executing digital signal processor instructions and microcontroller instructions, while achieving efficient digital signal processor computation and high code density. A finite impulse response filter algorithm achieves high performance on the dual execution units.

Description

DIGITAL SIGNAL PROCESSOR COMPUTATION CORE
Field of the Invention
This invention relates to digital signal processors and, more particularly, to digital signal processor computation core architectures that facilitate complex digital signal processing computations.
Background of the Invention
A digital signal computer, or digital signal processor (DSP), is a special purpose computer that is designed to optimize performance for digital signal processing applications, such as, for example, Fast Fourier transforms, digital filters, image processing and speech recognition. Digital signal processor applications are typically characterized by real-time operation, high interrupt rates and intensive numeric computations. In addition, digital signal processor applications tend to be intensive in memory access operations and to require the input and output of large quantities of data. Digital signal processor architectures are typically optimized for performing such computations efficiently.
Microcontrollers, by contrast, involve the handling of data but typically do not require extensive computation. Microcontroller application programs tend to be longer than DSP programs. In order to limit the memory requirements of microcontroller application programs, it is desirable to provide a high degree of code density in such programs. Thus, architectures that are optimized for DSP computations typically do not operate efficiently as microcontrollers. Also, microcontrollers typically do not perform well as digital signal processors. Nonetheless, a particular application may require both digital signal processor and microcontroller functionality. - 2 -
Digital signal processor designs may be optimized with respect to different operating parameters, such as computation speed and power consumption, depending on intended applications. Furthermore, digital signal processors may be designed for 16-bit words, 32-bit words, or other word sizes. A 32-bit architecture that achieves very high operating speed is disclosed in U.S. Patent No. 5,954,811 issued September 21, 1999 to Garde.
Digital signal processors frequently utilize architectures wherein two or more data words are stored in each row of memory, and two or more data words are provided in parallel to the computation unit. Such architectures provide enhanced performance, because several instructions and/or operands may be accessed simultaneously.
Notwithstanding the performance levels of current digital signal processors, there is a need for further enhancements in digital signal processor performance.
Summary of the Invention According to a first aspect of the invention, a computation unit is provided. The computation unit is preferably configured for performing digital signal processor computations. The computation unit comprises an execution unit for performing an operation on a first operand and a second operand in response to an instruction, a register file for storing operands, first and second operand buses coupled to the register file, and first and second data selectors. The first and second operand buses each carry a high operand and a low operand. The first data selector supplies the high operand or the low operand from the first operand bus to the execution unit in response to a first operand select value contained in the instruction. The second data selector supplies the high operand or the low operand from the second operand bus to the execution unit in response to a second operand select value contained in the instruction. The execution unit may comprise an arithmetic logic unit, a multiplier and an accumulator. In one embodiment, the register file comprises first and second register banks, each having two read ports and two write ports. In another embodiment, the register file comprises a single register bank having four read ports and four write ports.
According to another aspect of the invention, a computation unit is provided. The computation unit comprises an execution unit for performing an operation on first and second operands in response to an instruction, a register file for storing operands, an operand bus coupled to the register file, the operand bus carrying a high operand and a low operand, and a data selector, responsive to an operand select value contained in the instruction, for supplying the high operand or the low operand from the operand bus to the execution unit.
According to another aspect of the invention, a method is provided for performing a digital computation. The method comprises the steps of storing operands for the computation in a register file, supplying operands from the register file on first and second operand buses, each carrying a high operand and a low operand, selecting the high operand or the low operand from the first operand bus in response to a first operand select value contained in an instruction and supplying a selected first operand to the execution unit, selecting the high operand or the low operand from the second operand bus in response to a second operand select value contained in the instruction and supplying a selected second operand to the execution unit, and performing an operation specified by the instruction on the operands selected from the first and second operand buses. According to another aspect of the invention, a digital signal processor computation unit is provided. The digital signal processor computation unit comprises first and second execution units for performing operations in response to an instruction and for producing first and second results, a result register for storing the results of the operations, the result register having first and second locations, and result swapping logic, coupled between the first and second execution units and the result register, for swapping the first and second results between the first and second locations in the result register in response to result swapping information contained in the instruction.
The first and second execution units may comprise first and second arithmetic logic units for performing add and subtract operations. The first and second execution units are separately controllable and may perform the same or different operations in response to operation code information contained in the instruction. The first and second arithmetic logic units may comprise 16-bit arithmetic logic units which are configurable as a 32-bit arithmetic logic unit. The first and second locations in the result register may comprise high and low halves of the result register. The result register may comprise a register in a register file. According to another aspect of the invention, a method is provided for performing digital signal computations. The method comprises the steps of performing operations in first and second execution units in response to an instruction and producing first and second results, storing the results of the operations in a result register having first and second locations, and swapping the first and second results with respect to the first and second locations in the result register, in response to result swapping control information contained in the instruction.
According to another aspect of the invention, a digital signal processor computation unit is provided. The digital signal processor computation unit comprises first and second execution units for performing operations in response to an instruction and for producing first and second results, a result register for storing the results of the operations, the result register having first and second locations, and means for swapping the first and second results with respect to the first and second locations in the result register, in response to result swapping control information contained in the instruction.
According to another aspect of the invention, a digital signal processor computation core is provided. The digital signal processor computation core comprises first and second execution units for performing first and second operations in response to control signals, and control logic for providing the control signals to the first and second execution units in response to control information contained in an instruction for individually controlling the first and second operations. In one example, the first and second execution units comprise first and second arithmetic logic units. The first and second operations may be selected from add operations and subtract operations, and may be the same or different.
The computation core may further comprise a register file for storing operands and results of the first and second operations, and first and second operand buses coupled between the register file and the first and second execution units, each of the first and second operand buses carrying a high operand and a low operand, wherein the first execution unit performs the first operation on the high operands and the second execution unit performs the second operation on the low operands. According to another aspect of the invention, a method is provided for performing digital signal computations. The method comprises the steps of performing first and second operations in first and second execution units, and individually controlling the first and second operations in response to control information contained in an instruction. According to a further aspect of the invention, a digital signal processor computation core is provided. The digital signal processor computation core comprises first and second execution units for performing first and second operations in response to control signals, and means responsive to control information contained in an instruction for providing the control signals to the first and second execution units for individually controlling the first and second operations, wherein the first and second operations may be the same or different. According to a further aspect of the invention, a computation core is provided for executing programmed instructions. The computation core comprises an execution block for performing digital signal processor operations in response to digital signal processor instructions and for performing microcontroller operations in response to microcontroller instructions, a register file for storing operands for and results of the digital signal processor operations and the microcontroller operations, and control logic for providing control signals to the execution block and the register file in response to the digital signal processor instructions and the microcontroller instructions for executing the digital signal processor instructions and the microcontroller instructions. Preferably, the digital signal processor instructions are configured for high efficiency digital signal computations, and the microcontroller instructions are configured for code storage density. In one example, the microcontroller instructions have a 16-bit format and the digital signal processor instructions have a 32-bit format. The digital signal processor instructions may contain information indicating whether one or more related instructions follow. The related instructions may comprise load instructions.
According to a further aspect of the invention, a method is provided for executing programmed instructions. The method comprises the steps of executing digital signal processor operations in an execution block in response to digital signal processor instructions configured for efficient digital signal computation, and executing microcontroller operations in the execution block in response to microcontroller instructions configured for code storage density. An application program having a mixture of digital signal processor instructions - 7 -
and microcontroller instructions is characterized by high code storage density and efficient digital signal computation.
According to another aspect of the invention, a digital signal processor having a pipeline structure is provided. The digital signal processor comprises a computation block for executing computation instructions, the computation block having one or more computation stages of the pipeline structure, and a control block for fetching and decoding the computation instructions and for accessing a memory, the control block having one or more control stages of the pipeline structure. The computation stages and the control stages are positioned in the pipeline structure such that a result of the memory access is available to the computation stages without stalling the computation stages.
The computation stages and the control stages may be positioned in the pipeline structure so as to avoid stalling the computation stages when a computation instruction immediately follows a memory access instruction and requires the result of the memory access instruction. The computation stages and the control stages may be positioned in the pipeline structure such that the control block has one or more idle stages following completion of the memory access. The computation stages and the control stages may be positioned in the pipeline structure such that the computation block has one or more idle stages prior to a first computation stage.
According to another aspect of the invention, a method is provided for a digital signal computation. The method comprises the steps of executing computation operations in a computation block having one or more computation stages, executing control operations, including fetching instructions, decoding instructions and accessing a memory, in a control block having one or more control stages, wherein the computation stages and the control stages are configured in a pipeline structure, and positioning the computation stages relative to the control stages in the pipeline structure such that a result of a memory access is available to the computation stages without stalling the computation stages.
According to a further aspect of the invention, a method is provided for determining an output of a finite impulse response digital filter having L filter coefficients in response to a set of M input samples. The method comprises the steps of (a) loading a first input sample into a first location in a first register, (b) loading a second input sample into a second location in the first register, (c) loading two coefficients into a second register, (d) computing intermediate results using the contents of the first and second registers, (e) loading a new input sample into the first location in the first register, (f) computing intermediate results using the contents of the first and second registers, (g) repeating steps (b)-(f) for L iterations to provide two output samples, and (h) repeating steps (a)-(g) for M/2 iterations to provide M output samples. Step (d) may comprise a multiply accumulate operation on a first coefficient in the second register and the input sample in the first location in the first register, and a multiply accumulate operation on the first coefficient in the second register and the input sample in the second location in the first register. Step (f) may comprise a multiply accumulate operation on a second coefficient in the second register and the input sample in the first location in the first register, and a multiply accumulate operation on the second coefficient in the second register and the input sample in the second location in the first register. It will be understood that the foregoing aspects of the invention may be practiced separately or in any combination.
Brief Description of the Drawings
For a better understanding of the present invention, reference is made to the accompanying drawings, which are incoφorated herein by reference and in which: FIG. 1 is a block diagram of a computation core in accordance with an embodiment of the invention;
FIG. 2 is a block diagram of a digital signal processor incorporating the computation core of FIG. 1 ; FIG. 3 is a more detailed block diagram of a portion of the computation core, showing a first embodiment of the register file;
FIG. 3 A is a more detailed block diagram of a portion of the computation core, showing a second embodiment of the register file;
FIG. 4 is a block diagram of the execution units shown in FIG. 1 ; FIG. 4A is a more detailed block diagram of a portion of one of the execution units shown in FIG. 4;
FIG. 5 schematically illustrates an example of the operation of the multiplier/accumulators in the execution units;
FIG. 6A-6D schematically illustrate examples of the selection of different operands by one of the multiplier/accumulators;
FIG. 7A-7D schematically illustrate examples of dual 16-bit arithmetic logic unit (ALU) operations which may be performed by the ALUs in the execution units;
FIG. 7E schematically illustrates an example of a quad 16-bit ALU operation which may be performed by the ALUs in the execution units;
FIG. 8 schematically illustrates the swapping of results produced by the ALUs;
FIG. 9 schematically illustrates an example of a 32-bit DSP multiply accumulate instruction format that may be used in the computation core of FIG. 1;
FIG. 10 schematically illustrates an example of a 32-bit ALU instruction format that may be used in the computation core of FIG. 1 ; FIG. 11 schematically illustrates an example of a 16-bit microcontroller instruction format that may be used in the computation core of FIG. 1 ;
FIG. 12 schematically illustrates the operation of the pipeline in the computation core of FIG. 1 ; FIG. 13 schematically illustrates the operation of a prior art pipeline;
FIG. 14 is a block diagram that illustrates an embodiment of the pipeline structure in the computation core of FIG. 1 ;
FIGS. 15A-15C schematically illustrate the operation of an FIR digital filter algorithm that may run efficiently on the computation core of FIG. 1 ; and FIG. 16 shows pseudo-code for an example of an FIR digital filter algorithm that may run efficiently on the computation core of FIG. 1
Detailed Description
A block diagram of an embodiment of a computation core 10 in accordance with the invention is shown in FIG. 1. A block diagram of a digital signal processor 20 incorporating computation core 10 is shown in FIG. 2. Preferably, digital signal processor 20 is implemented as a monolithic integrated circuit which incorporates computation core 10.
Computation core 10 includes a computation block 24 and an addressing block 26 coupled through operand buses 30 and result buses 32 to a memory interface 34. Address buses 40 and 42 are coupled between addressing block 26 and memory interface 34. Computation core 10 further includes an instruction sequencer 50 coupled by an instruction address bus 52 and an instruction bus 54 to memory interface 34. Memory interface 34 is connected by memory buses 60 and 62 to a memory 64 (FIG. 2), including memory banks 70, 72, 74 and 76, located external to computation core 10.
As shown in FIG. 1 , computation block 24 includes a register file 80 and execution units 82 and 84, each of which are connected to operand buses 30 and result buses 32. Execution unit 82 (execution unit 0) includes an arithmetic logic unit (ALU) 90, a multiplier 92, an accumulator 94, and a shifter 96. Execution unit 84 (execution unit 1) includes an ALU 100, a multiplier 102, and an accumulator 104. The structure and operation of computation block 24 are described in detail below.
The addressing block 26 includes an address register file 120 and data address generators 124. In a preferred embodiment, address register file 120 has a capacity of 8 address values. The address register file 120 may be used for microcontroller programs that require simple addressing, and may access different word widths (8-bit bytes, 16-bit half words, and 32-bit words). The addressing block 26 may include four data address generators (DAGs) 124 for generating address sequences or patterns. The addresses generated by addressing block 26 are supplied through address buses 40 and 42, memory interface 34 and memory buses 60 and 62 to memory 64 (FIG. 2). Instruction sequencer 50 includes a loop buffer 130, an instruction decoder 132 and sequencer/control logic 134. Instructions are received from memory 64 through one of the memory buses 60 or 62 and are delivered to the instruction sequencer 50 via instruction bus 54. The instructions are temporarily stored in loop buffer 130. The loop buffer 130 is used for implementing repetitive code sequences with no overhead. The instructions are decoded in the instruction decoder 132 and are interpreted by the sequencer/control logic 134 to control operations by the rest of the computation core.
The integration of computation core 10 into digital signal processor 20 is shown in FIG. 2. Core 10 is connected to the other elements of the digital signal processor 20 through memory buses 60 and 62. The digital signal processor 20 may further include a memory bus 150, which is not connected to computation core 10 and an industry standard bus 152, also not connected to computation core 10. Standard bus 152 may, for example, be a Peripheral Components Interconnect (PCI) bus and may be connected to memory buses 60, 62 and 150 through a peripheral bus bridge 154. As shown, memory buses 60, 62 and 150 are connected to memory banks 70, 72, 74 and 76, peripheral bus bridge 154, a DMA controller 160 and an external memory bus controller 162. The external memory bus controller 162 permits the digital signal processor 20 to be connected to an external memory via an external memory bus 164. The standard bus 152 may be connected to a custom peripheral interface 170, a serial port 172, a microcontroller host port 174, an FPGA (field programmable gate array) based peripheral 176, a custom algorithm accelerator 178 and a purchased peripheral interface 180. It will be understood that different elements may be added to or removed from the digital signal processor 20 for different applications.
A block diagram of a first embodiment of register file 80, execution units 82 and 84 and memory 64, and the interconnection between these elements, is shown in FIG. 3. In the embodiment of FIG. 3, register file 80 has eight registers and is partitioned into register file banks 200 and 202, each having four registers of 32 bits each. Thus, register file bank 200 contains registers R0-R3, and register file bank 202 contains registers R4-R7. This arrangement results in low power because each four entry register file bank 200, 202 requires less energy per access than a single eight entry register file. Each four entry register file bank 200, 202 requires two read ports and two write ports, while an eight entry register file requires four read ports and four write ports.
Register file 80 is connected to execution units 82 and 84 and to memory 64 by operand buses 30 and result buses 32. Operand buses 30 include operand bus 210, operand bus 212, operand bus 214 and operand bus 216. Operand buses 210 and 212 are connected between register file banks 200 and 202 and memory 64 for writing results of computations to memory. In another embodiment, a single operand bus may be used for writing data from register file 80 to memory 64. Operand buses 214 and 216 are connected between register file banks 200 and 202 and execution units 82 and 84 for supplying operands to execution units 82 and 84. Result buses 32 include result bus 220, result bus 222, result bus 224 and result bus 226. Result buses 220 and 222 are connected between memory 64 and register file banks 200 and 202 for reading operands from memory 64. Result buses 224 and 226 are connected between execution units 82 and 84 and register file banks 200 and 202 for writing results of computations in register file 80. In a preferred embodiment, each of operand buses 210, 212, 214 and 216 and each of result buses 220, 222, 224 and 226 is 32 bits wide. As described above, memory 64 is external to computation core 10. Thus, the connections to memory 64 are via memory interface 34 and memory buses 60 and 62, as described above in connection with FIGS. 1 and 2. A block diagram of a second embodiment of register file 80, execution units 82 and 84 and memory 64, and the interconnection between these elements, is shown in FIG. 3 A. Like elements in FIGS. 3 and 3 A have the same reference numerals. In the embodiment of FIG. 3 A, register file 80 has a single register file bank 240 having eight registers, R0-R7, of 32 bits each. Register file bank 240 has four read ports and four write ports. A block diagram of execution units 82 and 84 is shown in FIG. 4. A portion of execution unit 82 is shown in more detail in FIG. 4A. Execution unit 82 includes a multiplier array 250, an ALU 252, an accumulator 254 and a barrel shifter 256. Execution unit 84 includes a multiplier array 260, an ALU 262 and an accumulator 264. Each multiplier array 250, 260 receives two 16-bit operands and provides two 32-bit outputs to the respective ALUs 252, 262. ALUs 252 and 262 may also receive two 32-bit inputs from operand buses 214 and 216. In a preferred embodiment, ALUs 252 and 262 are 40-bit ALUs. The output of ALU 252 is connected to accumulator 254 and is connected through a result swap mux (multiplexer) 280 to one input of an output select mux 282. The output of accumulator 254 is connected to a second input of output select mux 282 and is connected to an input of ALU 252. Similarly, the output of ALU 262 is connected to accumulator 264 and is connected through a result swap mux 284 to an output select mux 286. The output of accumulator 264 is connected to a second input of output select mux 286 and to an input of ALU 262. The output of output select mux 282 is connected to result bus 226, and the output of output select mux 286 is connected to result bus 224.
In multiply operations, the multiplier arrays 250 and 260 and the ALUs 252 and 262 are utilized. In multiply accumulate (MAC) operations, multiplier arrays 250 and 260, ALU's 252 and 262 and accumulators 254 and 264 are utilized. In add/subtract operations, ALUs 252 and 262 are utilized. The appropriate outputs are selected by output select muxes 282 and 286 and are supplied on result buses 226 and 224 to register file 80. The operations of the execution units 82 and 84 are described in more detail below.
It will be understood that FIG. 4 shows only the data paths in the execution units 82 and 84. Each element of execution units 82 and 84 receives control signals from the sequencer/control logic 134 (FIG. 1) for controlling operations in accordance with instructions being executed. Each of the operand buses 214 and 216 is 32 bits wide and carries two
16-bit operands, designated as a high operand and a low operand. The computation block 24 is preferably provided with an arrangement of data selectors which permits the multiplier in each of execution units 82 and 84 to select the high or low operand from each of the operand buses 214 and 216. As shown in FIG. 4, a mux (data selector) 300 selects the high operand or the low operand from operand bus 214 for input to multiplier array 250, and a mux 302 selects the high operand or the low operand from operand bus 216 for input to multiplier array 250. Similarly, a mux 310 selects the high operand or the low operand from operand bus 214 for input to multiplier array 260, and a mux 312 selects the high operand or the low operand from operand bus 216 for input to multiplier array 260. The select inputs to muxes 300, 302, 310 and 312 are controlled in response to information contained in instructions as described below. This arrangement for selecting operands provides a high degree of flexibility in performing digital signal computations.
A schematic representation of a dual multiply accumulate operation by execution units 82 and 84 is shown in FIG. 5. Like elements in FIGS. 4 and 5 have the same reference numerals. A 32-bit data element 340 represents the operands supplied from register file 80 on operand bus 214, and a 32-bit data element 342 represents the operands supplied from register file 80 on operand bus 216. A 40-bit data element 344 represents the contents of accumulator 254, and a 40-bit data element 346 represents the contents of accumulator 264. Multiplier array 250 receives the low operands from data elements 340 and 342 and supplies an output to ALU 252. ALU 252 adds the output of multiplier array 250 and data element 344 and places the result in accumulator 254 as a new data element 344. Similarly, multiplier array 260 receives the high operands from data elements 340 and 342 and supplies an output to ALU 262. ALU 262 adds the output of multiplier array 260 and data element 346 from accumulator 264 and places the result in accumulator 264 as a new data element 346.
In the example of FIG. 5, muxes 300 and 302 (FIG. 4) select the low operands from operand buses 214 and 216 and supply the low operands to multiplier array 250. Muxes 310 and 312 select the high operands from operand buses 214 and 216 and supply the high operands to multiplier array 260.
Selection of different operands for computation by execution unit 82 is illustrated in the schematic representations of FIGS. 6A-6D. Like elements in FIGS. 4, 5 and 6A-6D have the same reference numerals. As shown in FIG. 6A, the low operand of data element 340 and the low operand of data element 342 are supplied to multiplier array 250. As shown in FIG. 6B, the high operand of data element 340 and the low operand of data element 342 are supplied to multiplier array 250. As shown in FIG. 6C, the low operand of data element 340 and the high operand of data element 342 are supplied to multiplier array 250. As shown in FIG. 6D, the high operand of data element 340 and the high operand of data element 342 are supplied to multiplier array 250. In each case, the data element 340 appears on operand bus 214 (FIG. 4), and the data element 342 appears on operand bus 216. The selection of operands for multiplier array 250 is made by muxes 300 and 302, as shown in FIG. 4. In the same manner, muxes 310 and 312 perform operand selection for multiplier array 260. The muxes 300, 302, 310 and 312 are controlled by select signals derived from instructions being executed, as described below.
The operand selection technique is described above in connection with dual multiply accumulate (MAC) units. However, since this technique relates to the data movement and selection aspects of computation, it is generally applicable to data selection for any execution unit that performs any arbitrary arithmetic operation. In addition, although the description relates to selection of one of two 16-bit operands, the operand selection technique can be implemented with operands of any width and with two or more operands. When using the operand selection technique, the programmer selects two pairs of adjacent 16-bit data elements that reside in register file 80. When these two pairs of 16-bit data elements are selected and transferred to the execution units 82 and 84 via operand buses 214 and 216, the programmer selects a high or low 16-bit operand from a 32-bit data element to serve as one input to one of the MACs. The other input to the same MAC is a high or low 16-bit operand selected from the other operand bus. The execution units 82 and 84 also execute instructions which specify ALU operations, i.e., operations which involve addition or subtraction and which do not require the multiplier array or the accumulator. The ALUs 252 and 262 may be configured for performing various ALU operations. In most cases, only one of the ALU's 252 and 262 is active in performing ALU operations. An exception is shown in FIG. 7E and is described below. ALU operations are described in connection with ALU 252 and execution unit 82. It will be understood that the same ALU operations can be performed by ALU 262 in execution unit 84. In one configuration, ALU 252 performs a 32- bit add or subtract and outputs a 32-bit result through result swap mux 280 and output select mux 282 to result bus 226.
The ALU 252 may be configured for performing two 16-bit addition or subtraction operations, as illustrated in FIGS. 7A-7D. In particular, 32-bit ALU 252 is configured to function as two 16-bit ALUs 360 and 362 (FIG. 4A). A 32-bit ALU may be configured as two independent 16-bit ALUs by blocking the carry from bit 15 into bit 16. As shown in FIG. 7 A, ALU 360 adds the high operands of data elements 340 and 342 and places the 16-bit result in a high result portion of a data element 364. ALU 362 adds the low operands of data elements 340 and 342 and places the result in a low result portion of data element 364. The 32-bit data element 364 is supplied on result bus 226 to register file 80.
FIGS. 7A, 7B, 7C and 7D illustrate the fact that 16-bit ALUs 360 and 362 are separately programmable in response to control 1 and control 0 signals (FIG. 4A), and may perform the same or different operations. FIG. 7A illustrates the case where ALU 360 and ALU 362 both perform add operations. FIG. 7B illustrates the case where ALU 360 and ALU 362 both perform subtract operations. FIG. 7C illustrates the case where ALU 360 performs an add operation and ALU 362 performs a subtract operation. FIG. 7D illustrates the case where ALU 360 performs a subtract operation and ALU 362 performs an add operation. The control 1 and control 0 signals are supplied from instruction decoder 132 (FIG. 1) in response to decoding of an instruction being executed. As described above, ALU operations typically utilize only one of the execution units 82 and 84. An exception is described with reference to FIG. 7E. In this configuration, the sum and the difference of each pair of 16-bit operands is generated. The 16-bit ALU 360 subtracts the high operands of data elements 340 and 342 and places the result in the high result portion of data word 364. The 16-bit ALU 362 adds the low operands of data elements 340 and 342 and places the result in the low result portion of data element 364. This configuration further utilizes 32-bit ALU 262 in execution unit 84 configured as 16-bit ALUs 370 and 372. The 16-bit ALU 370 adds the high operands of data elements 340 and 342 and places the result in a high result portion of a data element 374. The 16-bit ALU 372 subtracts the low operands of data elements 340 and 342 and places the result in a low result portion of data element 374. Data element 374 is supplied on result bus 224 to register file 80. In this configuration, execution units 82 and 84 simultaneously perform four 16-bit ALU operations.
FIGS. 7A-7D illustrate a configuration where 16-bit ALU 360 and 16-bit ALU 362 are separately programmable and the operations performed by ALUs 360 and 362 may be the same or different. As described below in connection with FIG. 10, an ALU instruction includes operation fields which specify the individual operations to be performed by ALUs 360 and 362. This individual control feature is generally applicable to any execution units that perform two or more operations simultaneously. Thus, for example, the multiplier accumulators in execution units 82 and 84 are individually controllable and may perform the same or different operations. As described below in connection with FIG. 9, a multiplier accumulator instruction includes operation fields which individually specify the operations to be performed by execution units 82 and 84. The individual control feature can be implemented with execution units of any type or width, and with two or more execution units, or with a single execution unit having two or more computation devices. A further feature of execution units 82 and 84 is described with reference to FIGS. 4 A and 8. As shown, the results generated by 16-bit ALUs 360 and 362 may be reversed, or swapped, with respect to their placement in 32-bit data element 364. Specifically, the output of ALU 360 is supplied to the low result portion of data element 364, and the output of ALU 362 is supplied to the high result portion of data element 364. This reversed or swapped configuration is contrasted with the configuration of FIGS. 7A-7D, where the output of ALU 360 is supplied to the high result portion of data element 364 and the output of ALU 362 is supplied to the low result portion of data element 364. The reversal or swapping of the outputs of ALUs 360 and 362 is performed by result swap mux 280 (FIG. 4) in response to information contained in an instruction. The result swapping operation at the output of ALUs 360 and 362 is useful, for example, to achieve conjugation in complex arithmetic.
As shown in FIG. 4 A, result swap mux 280 may include a mux 380, which is controlled by a swap signal, and a mux 382, which is controlled by an inverted swap signal. Each of the muxes 380 and 382 receives the 16-bit results from ALUs 360 and 362. When the swap signal is not asserted, the output of ALU 360 is supplied to the high result portion of result bus 226, and the output of ALU 362 is supplied to the low result portion of result bus 226. When the swap signal is asserted, the output of ALU 360 is supplied to the low result portion of result bus 226, and the output of ALU 362 is supplied to the high result portion of result bus 226, thereby swapping the outputs of ALUs 360 and 362. In FIG. 4A, output select mux 282 (FIG. 4) is omitted for simplicity of illustration. - 20 -
The result swapping technique is described above in connection with swapping of ALU outputs. However, since this technique relates to the data movement aspects of computation, it is generally applicable to result swapping for any execution unit that produces two or more results. As described below in connection with FIG. 10, an ALU instruction includes a field which specifies whether or not the results of the ALU operations are to be swapped. The result swapping technique can be implemented with results of any width and with two or more results.
An example of a DSP type MAC instruction format for controlling execution units 82 and 84 to perform multiply accumulate operations is shown in FIG. 9. A multiplier accumulator instruction 400 has a 32-bit format, with the fields of the instructions as shown in FIG. 9. Source fields, srcO and srcl, each having three bits, identify the registers in register file 80 which are to provide the operands for the computation on operand buses 214 and 216 (FIG. 4). A three bit destination field, dst, identifies the register in register file 80 where the result of the computation is to be stored. Operation fields, opO and opl, each having two bits, indicate the operations to be performed by execution units 82 and 84, respectively. The operations include multiply, multiply-add, multiply-subtract and no operation. The multiply-add and multiply-subtract operations are MAC operations. A P field indicates whether the result is to be written to a single register or written to a register pair. Two w fields wl, and wO, indicate whether the result is to be accumulated only or accumulated and written to a register. The wl field applies to execution unit 82, and the wO field applies to execution unit 84. An hOO field indicates whether to select the high operand or the low operand of source 0 (srcO) for execution unit 82. An hlO field indicates whether to select the high operand or the low operand of source 1 (srcl) for execution unit 82. An hOl field indicates whether to select the high operand or the low operand of source 0 for execution unit 84. An hi 1 field indicates whether to select the high operand or the low operand of source 1 for execution unit 84. Thus, the hOO and hlO fields control muxes 300 and 302, respectively, at the inputs to execution unit 82, and the hOl and hi 1 fields control muxes 310 and 312, respectively, at the inputs to execution unit 84. An MM field indicates whether or not execution unit 84 is in mixed mode (signed/unsigned). An mmod field indicates fraction or integer operation, signed or unsigned operation, round or truncate operation and scaled or unsealed operation. An M field indicates whether or not two load/store instructions follow the instruction. An example of a DSP type ALU instruction format for controlling execution units 82 and 84 to perform ALU operations is shown in FIG. 10. An ALU instruction 450 has a 32-bit format. As in the case of the multiply accumulate instruction, the M field indicates whether or not two load/store instructions follow the instruction. An operation code field, aopcde, is used in conjunction with a secondary op code field, aop, to specify a particular arithmetic operation. Examples include single 16-bit ALU operations, single 32-bit ALU operations, dual 16-bit ALU operations and quad 16-bit ALU operations, as well as other arithmetic operations known to those skilled in the art. Source fields, srcO and srcl, each having 3 bits, specify the source registers in register file 80 containing 32-bit data elements for the computation.
Destination fields, dstO and dstl, each having 3 bits, specify the destination registers in register file 80 for storing the results of the computation. An HL field indicates whether the result of a single ALU operation is to be deposited in the high half or the low half of the destination register. An x field indicates whether or not two 16-bit results are to be swapped as they are deposited in the destination register. The value contained in the x field controls the operation of result swap mux 280 (FIG. 4A) as described above. An s field determines whether saturation is active or inactive. In the case of a single 16-bit add or subtract, the aop field indicates the two operands that are to be added or subtracted, i.e., low and low; low and high; high and low; or high and high. The HL field indicates whether the 16-bit result is to be deposited in the high or low half of the destination register. In the case of a dual 16-bit add or subtract, the aop field indicates the two operations to be performed by the two 16-bit ALUs, i.e., add/add; add/subtract; subtract/add; or subtract/subtract. In the dual 16-bit add or subtract operations, the aop field controls the individual operations performed by ALUs 360 and 362 (see FIGS. 7A-7D). In the case of quad 16-bit add or subtract operations, the aop field controls the operations performed by 16-bit ALUs 360, 362, 370 and 372 (FIG. 7E). The possible operations are add/add for one execution unit and subtract/subtract for the other execution unit, or add/subtract for one execution unit and subtract/add for the other execution unit, to avoid redundant calculations. The aopcde field in instruction 450 may also specify a 32-bit add or subtract operation.
The instruction formats shown in FIGS. 9 and 10 and described above are DSP instructions. These instructions are characterized by a high degree of flexibility and include optional features to permit efficient digital signal processor computations. An example of a microcontroller type instruction format for controlling execution units 82 and 84 to perform arithmetic operations is shown in FIG. 11. An instruction 480 has a length of 16 bits and contains only three fields, a 4-bit operation code field, ope, a 3-bit source field, src, and a 3-bit destination field, dst. The input operands are taken from the registers in register file 80 specified by the src and dst fields. The result of the computation is placed in the register specified by the dst field, thereby overwriting one of the operands. The operation code field, ope, may specify add, subtract, multiply, as well as other arithmetic operations known to those skilled in the art. It may be observed that instruction 480 is relatively simple and has only three fields that may be specified by the programmer. However, because instruction 480 has a length of 16 bits, it occupies only half of the memory space that is occupied by the more complex DSP instructions described above. As described above, code density is an important factor in microcontroller applications. A typical microcontroller application may have a relatively large number of instructions requiring relatively simple computations and data handling. Because the number of instructions in a microcontroller application may be large, code density is an important factor in minimizing memory requirements. By contrast, DSP applications typically include a relatively small number of instructions which may be executed repetitively in performing DSP computations. Thus, code density is less important than efficient execution in achieving high performance in DSP applications. By providing instruction formats of the type described above in connection with FIGS. 9-11, microcontroller and DSP functions may be combined efficiently in a single computation core. A combined application typically includes a relatively large number of 16-bit microcontroller instructions and a relatively small number of 32-bit DSP instructions, thereby achieving a high degree of code density. The relatively small number of DSP instructions can be optimized for the highest performance in executing DSP computations.
The computation core 10 preferably has a pipeline architecture, as illustrated in FIGS. 12 and 14. In the embodiment of FIGS. 12 and 14, the pipeline has eight stages. In a pipeline architecture, each stage performs a specified function of instruction execution, permitting multiple instructions to be executed simultaneously, with each instruction having a different phase of execution. FIG. 12 is a pipeline timing diagram wherein a horizontal row of blocks represents the functions performed by the different stages of the pipeline in executing a single instruction. Thus, row 500 represents execution of a first instruction, and row 502 represents execution of a second instruction. Vertically aligned blocks represent functions that are performed simultaneously by different stages in the pipeline.
In FIG. 12, stages 0 and 1 perform instruction fetch (IF) from an instruction cache 510 (FIG. 14). Stage 2 performs instruction decoding (ID) in instruction decoder 132. Stage 3 performs data address generation (DAG) in DAG 124. Stages 4 and 5 perform data memory access (Ml and M2) in memory 64. The instruction fetch, instruction decode, data address generation and memory access functions are performed by the control section of computation core 10, including instruction sequencer 50 and addressing block 26 (FIG. 1). Stages 4-7 include operations performed by computation block 24. Stage 4 performs register file read (RFR) from register file 80. Stages 5 and 6 perform multiply accumulate operations (MAC1 and MAC2) in execution units 82 and 84. In particular, the MAC1 operation of stage 5 is executed by multiplier arrays 250 and 260, and the MAC2 operation of stage 6 is executed by ALUs 250 and 262 and accumulators 254 and 264. Arithmetic logic and shift operations (EX) of stage 6 are executed by ALUs 252 and 262 or barrel shifter 270. The stage 7 operation is a register file write (RFW) from execution units 82 and 84 to register file 80. In the pipeline structure shown in FIG. 14, pipeline stages are separated by latches 508 controlled by a system clock, as known in the art.
The pipeline shown in FIGS. 12 and 14 and described above is optimized for achieving high performance when executing DSP code. A feature of the pipeline is that memory access operations (Ml and M2), such as loads and stores, occur early in the pipeline relative to the computation operations (EX, MAC1 and MAC2), thus achieving early memory access. In FIG. 12 this is illustrated by the arrow from the end of the second memory access stage (M2) in row 500 to the beginning of the first computation stage (MAC1) in row 502. The arrow represents a register file bypass operation wherein data loaded from memory is supplied directly to execution units 82 and 84, and register file 80 is bypassed. In DSP code, an instruction sequence of a load instruction followed by a multiply accumulate (MAC) is very common. The pipeline organization shown in FIGS. 12 and 14 does not produce any stalls in executing this sequence. It may be noted that in order to organize the pipeline in this manner, the memory access operations (DAG, Ml, and M2) occur relatively early in the pipeline and result in two idle pipeline stages, stages 6 and 7, in the control section of the computation core. Also, the computation operations (MAC1 and MAC2) occur relatively late in the pipeline and result in one idle stage (DAG), stage 3, in the computation block 24 of the computation core.
A timing diagram for a conventional pipeline is illustrated in FIG. 13. As shown, memory access operations (DAG, Ml, and M2) occur late in the pipeline relative to the computation operations (MAC1 and MAC2). In particular, memory access operations (Ml and M2) and computation operations (MAC1 and MAC2) both occur in stages 4 and 5. As a result, a one cycle stall is required between a load instruction and a computation instruction that immediately follows the load instruction. The stall may have a significant impact on performance where the sequence of instructions is contained in a loop that is executed multiple times. By contrast, the pipeline structure shown in FIGS. 12 and 14 does not require a stall between a load instruction and a computation instruction.
The early memory access pipeline structure shown in FIG. 12 has advantages in comparison with the prior art pipeline structure shown in FIG. 13. Load-to-use latencies in processors with execution units that have multiple pipeline stages are eliminated. Normally, processors with this type of execution unit suffer from load-to-use latencies. Elimination of load-to-use latencies results in simpler software that does not require loop unrolling or software pipelining, which are software techniques used to improve performance in processors with load-to-use latencies. Even when these techniques are applied, the performance of a conventional processor may be lower than that of the pipeline structure shown in FIGS. 12 and 14 and described above.
As noted above, the computation core structure described herein facilitates efficient digital signal computations. One example of a DSP algorithm that may be implemented efficiently on computation core 10 is a finite impulse response (FIR) digital filter. An FIR filter may be defined mathematically as
L-\ z(n) = ∑ c(k) x (n + k) , (1)
*=o
where x (n) are samples of an input signal, c(k) are L filter coefficients and z(n) are output signal samples. Each output z(n) is obtained by computing the vector product of L samples of the input signal x(n) times L filter coefficients c(k) and summing the products. All signals and coefficients are 16-bit data values in this example.
The dual multiply accumulate operations shown in FIGS. 5 and 6A-6D and described above, may be utilized to perform FIR filter computations. In particular, execution units 82 and 84 may be utilized to perform two multiply accumulate operations simultaneously. In order to perform two multiply accumulate operations, a conventional implementation of an FIR filter on a DSP with dual execution units would require that a total of four data values be loaded from memory: two input values from x(n) and two filter coefficients from c(n). These data loads are achieved by loading a pair of adjacent data values and a pair of adjacent filter coefficient values. A problem with this technique is that for half of the total number of memory accesses, the pairs of data values must come from locations that are not 32-bit aligned in memory. That is, the memory must be able to deliver data elements x(0) and x(l) into a register in an aligned 32-bit access, and must also be able to deliver data elements x(l) and x(2) to a register in a misaligned 32-bit access. Note that either the data elements x(n) or the coefficients c(n) must be accessed as misaligned 32-bit element pairs, but not both. One of these signals may always be accessed as 32-bit aligned pairs, and here it is assumed that coefficients c(n) are accessed as aligned 32-bit pairs. The delivery of misaligned 32-bit element pairs in prior art systems requires two memory accesses and, therefore, is relatively inefficient.
A novel FIR filter implementation avoids misaligned 32 bit data accesses as follows. Let execution unit 82 (MACO) compute all of the even indexed outputs and execution unit 84 (MAC1) compute all of the odd indexed outputs. For example, outputs z(0) through z(3) may computed as follows.
z(0) = x(0)-c(0) + x(l)-c(l) + x(2)-c(2) +— (2) z(l) = x(l)-c(0) + x(2)-c(l) + x(3)-c(2) +— (3) z(2) = x(2)-c(0) + x(3)-c(l) + x(4)-c(2) +— (4) z(3) = x(3)-c(0) + x(4)-c(l) + x(5)-c(2) +-» (5)
where execution units 82 computes z(0) and z(2), and execution unit 84 computes z(l) and z(3).
Assume that data sample pair x(0) and x(l) is loaded into register R0, as shown in FIG. 15A, and coefficient pair c(0) and c(l) is loaded into register RI. Two multiply accumulates are computed using the 16-bit operand selection method as follows. z(0) += x(0)-c(0), and z(l) += x(l)-c(0), (6)
where the symbol "+=" represents the multiply accumulate operation. The value z(0) is computed in execution unit 82, and the value z(l) is computed in computation unit 84. Both memory accesses illustrated in FIG. 15A are 32-bit aligned.
Next, rather than loading a data pair, a single data sample x(2) is loaded into the low half of register RO, as shown in FIG. 15B. It may be noted that the high half of register RO and all of register RI remain unchanged. Two multiply accumulates may now be computed as follows.
z(0) += x(l)-c(l), and z(l) += x(2)-c(l) (7)
Again, the value z(0) is computed in execution unit 82, and the value z(l) is computed in execution unit 84.
For the next set of two multiply accumulate computations, coefficient pair c(2) and c(3) is loaded into register RI and a single data sample x(3) is loaded into the high half of register RO, as shown in FIG. 15C. The low half of register RO is not changed. The two multiply accumulate computations are performed as follows.
z(0) += x(2)-c(2), and z(l) += x(3)-c(2), (8)
where the value of z(0) is computed in execution unit 82, and the value of z(l) is computed in execution unit 84.
With this technique, not only are all accesses aligned, but the execution units 82 and 84 are able to obtain all of the required input operands from only two 32-bit registers in the register file. This is the reason why this technique can be implemented in the architecture with high or low operand selection as described above. The inputs are loaded into register halves in a "ping pong" sequence. Without this ping pong sequence, the register file would be required to supply four 16-bit data elements to the execution units, rather than two 32-bit data elements (in addition to the filter coefficients), which would result in a more complex register file.
A pseudo-code representation of an algorithm for performing FIR digital filter computations as described above is shown in FIG. 16. The algorithm includes an outer loop and an inner loop. The outer loop is executed M/2 times, where M is the number of input data samples in the data set. Since two output values are computed on each pass of the outer loop, M/2 iterations are required. In the outer loop, a 16-bit data element x(0) is loaded into register RLO, the lower half of register RO, and the inner loop is executed L times, where L is the number of coefficients in the FIR filter. The inner loop performs the multiply accumulate operations for values of an index variable k for values of k from 0 to L-1. In the inner loop, a 16-bit data element x(n+k+l) is loaded into register RHO, the high half of register RO. Two 16-bit coefficients c(k+l) and c(k) are loaded into register RI . The multiply accumulate value z(n-ι-l) is computed in execution unit 84, and the result is stored in accumulator Al. The multiply accumulate value z(n) is computed in execution unit 82, and the result is stored in accumulator A0. Next, a 16-bit data element x(n+k+2) is loaded into register RLO, the low half of register RO, and the multiply accumulate values z(n+l) and z(n) are computed. As noted above, the inner loop is executed L times. While there have been shown and described what are at present considered the preferred embodiments of the present invention, it will be obvious to those skilled in the art that various changes and modifications may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims

1. A computation unit comprising: an execution unit for performing an operation on a first operand and a second operand in response to an instruction; a register file for storing operands; first and second operand buses coupled to said register file, each carrying a high operand and a low operand; a first data selector, responsive to a first operand select value contained in the instruction, for supplying the high operand or the low operand from said first operand bus to said execution unit; and a second data selector, responsive to a second operand select value contained in the instruction, for supplying the high operand or the low operand from said second operand bus to said execution unit.
2. A computation unit as defined in claim 1 wherein said execution unit comprises an arithmetic logic unit, a multiplier and an accumulator.
3. A computation unit as defined in claim 2 wherein said execution unit further comprises a shifter.
4. A computation unit as defined in claim 1 wherein said register file comprises first and second register banks, each having two read ports and two write ports.
5. A computation unit as defined in claim 1 wherein said register file comprises a single register bank having four read ports and four write ports.
6. A computation unit as defined in claim 1 further comprising first and second result buses coupled between said execution unit and said register file for transferring a result of the specified operation to the register file.
7. A computation unit as defined in claim 1 wherein each of said high and low operands is a 16-bit operand.
8. A computation unit as defined in claim 1 configured for performing digital signal processor computations.
9. A computation unit comprising: an execution unit for performing an operation on first and second operands in response to an instruction; a register file for storing operands; an operand bus coupled to said register file, said operand bus carrying a high operand and a low operand; and a data selector, responsive to an operand select value contained in the instruction, for supplying the high operand or the low operand from said operand bus to said execution unit.
10. A computation unit as defined in claim 9 wherein said execution unit comprises an arithmetic logic unit, a multiplier and an accumulator.
11. A computation unit as defined in claim 10 wherein said execution unit further comprises a shifter.
12. A computation unit as defined in claim 9 further comprising a result bus coupled between the execution unit and the register file for transferring a result of the specified operation to the register file.
13. A computation unit as defined in claim 9 wherein said register file comprises first and second register banks, each having two read ports and two write ports.
14. A computation unit as defined in claim 9 wherein said register file comprises a single register bank having four read ports and four write ports.
15. A method for performing a digital computation, comprising the steps of: storing operands for the computation in a register file; supplying operands from the register file on first and second operand buses, each carrying a high operand and a low operand; selecting the high operand or the low operand from the first operand bus in response to a first operand select value contained in an instruction and supplying a selected first operand to the execution unit; selecting the high operand or the low operand from the second operand bus in response to a second operand select value contained in the instruction and supplying a selected second operand to the execution unit; and performing an operation specified by the instruction on the operands selected from the first and second operand buses.
16. A digital signal processor computation unit comprising: first and second execution units for performing operations in response to an instruction and for producing first and second results; a result register for storing results of the operations, said result register having first and second locations; and result swapping logic, coupled between said first and second execution units and said result register, for swapping said first and second results between the first and second locations in said result register in response to result swapping information contained in the instruction.
17. A digital signal processor computation unit as defined in claim 16 wherein said first and second execution units comprise first and second arithmetic logic units for performing add and subtract operations.
18. A digital signal processor computation unit as defined in claim 17 wherein said first and second arithmetic logic units comprise 16-bit arithmetic logic units.
19. A digital signal processor computation unit as defined in claim 17 wherein the first and second locations in said result register comprise high and low halves of said result register.
20. A digital signal processor computation unit as defined in claim 16 wherein said first and second execution units are separately controllable and perform the same or different operations in response to operation code information contained in the instruction.
21. A digital signal processor computation unit as defined in claim 18 wherein said 16-bit arithmetic logic units are configurable as a 32-bit arithmetic logic unit.
22. A digital signal processor computation unit as defined in claim 16 wherein said result swapping logic comprises a first data selector for placing one of said first and second results in the first location in said result register in response to the result swapping information contained in the instruction and a second data selector for placing the other of said first and second results in the second location in response to the result swapping information.
23. A digital signal processor computation unit as defined in claim 16 wherein said result register comprises a register in a register file and wherein said register file is coupled to said result swapping logic by a result bus.
24. A method for performing digital signal computations, comprising the steps of: performing operations in first and second execution units in response to an instruction and producing first and second results; storing results of the operations in a result register having first and second locations; and swapping said first and second results with respect to the first and second locations in said result register, in response to result swapping control information contained in the instruction.
25. A digital signal processor computation unit comprising: first and second execution units for performing operations in response to an instruction and for producing first and second results; a result register for storing results of the operations, said result register having first and second locations; and means for swapping said first and second results with respect to the first and second locations in said result register, in response to result swapping control information contained in the instruction.
26. A digital signal processor computation core comprising: first and second execution units for performing first and second operations in response to control signals; and control logic for providing said control signals to said first and second execution units in response to control information contained in an instruction for individually controlling said first and second operations.
27. A digital signal processor computation core as defined in claim 26 wherein said first and second execution units comprise first and second arithmetic logic units.
28. A digital signal processor computation core as defined in claim 26 wherein said first and second execution units comprise 16-bit arithmetic logic units that can be combined into a 32-bit arithmetic logic unit.
29. A digital signal processor computation core as defined in claim 26 wherein said first and second execution units comprise multiply accumulate units.
30. A digital signal processor computation core as defined in claim 27 wherein said first and second operations are selected from add operations and subtract operations, and may be the same or different.
31. A digital signal processor computation core as defined in claim 26 further comprising a register file for storing operands and results of the first and second operations, and first and second operand buses coupled between said register file and said first and second execution units, each of said first and second operand buses carrying a high operand and a low operand, wherein said first execution unit performs said first operation on said high operands and said second execution unit performs said second operation on said low operands.
32. A method for performing digital signal computations, comprising the steps of: performing first and second operations in first and second execution units; and individually controlling said first and second operations in response to control information contained in an instruction.
33. A digital signal processor computation core comprising: first and second execution units for performing first and second operations in response to control signals; and means responsive to control information contained in an instruction for providing said control signals to said first and second execution units for controlling said first and second operations, wherein said first and second operations may be the same or different.
34. A computation core for executing programmed instructions, comprising: an execution block for performing digital signal processor operations in response to digital signal processor instructions and for performing microcontroller operations in response to microcontroller instructions; a register file for storing operands for and results of the digital signal processor operations and the microcontroller operations; and control logic for providing control signals to said execution block and said register file in response to the digital signal processor instructions and the microcontroller instructions for executing the digital signal processor instructions and the microcontroller instructions.
35. A computation core as defined in claim 34 wherein said microcontroller instructions have a 16-bit format and wherein said digital signal processor instructions have a 32-bit format.
36. A computation core as defined in claim 35 wherein said digital signal processor instructions are configured for high efficiency digital signal computations.
37. A computation core as defined in claim 35 wherein said digital signal processor instructions contain information indicating whether one or more related instructions follow.
38. A computation core as defined in claim 37 wherein said one or more related instructions comprise load instructions.
39. A computation core as defined in claim 34 wherein said execution block comprises dual execution units.
40. A computation core as defined in claim 34 wherein said execution block comprises a first execution unit including a multiply accumulate unit, an arithmetic logic unit and a shifter, and a second execution unit including a multiply accumulate unit and an arithmetic logic unit.
41. A computation core as defined in claim 34 wherein said digital signal processor instructions are characterized by a relatively long format and said microcontroller instructions are characterized by a relatively short format.
42. A method for executing programmed instructions, comprising the steps of: executing digital signal processor operations in an execution block in response to digital signal processor instructions configured for efficient digital signal computation; and executing microcontroller operations in the execution block in response to microcontroller instructions configured for code storage density, wherein an application program having a mixture of digital signal processor instructions and microcontroller instructions is characterized by high code storage density and efficient digital signal computation.
43. A method as defined in claim 42 wherein said digital signal processor instructions are characterized by a relatively long format and said microcontroller instructions are characterized by a relatively short format.
44. A digital signal processor having a pipeline structure, comprising: a computation block for executing computation instructions, said computation block having one or more computation stages of the pipeline structure; and a control block for fetching and decoding the computation instructions and for accessing a memory, said control block having one or more control stages of the pipeline structure, wherein said computation stages and said control stages are positioned in the pipeline structure such that a result of the memory access is available to said computation stages without stalling said computation stages.
45. A digital signal processor as defined in claim 44 wherein said computation stages and said control stages are positioned in the pipeline structure so as to avoid stalling said computation stages when a computation instruction immediately follows a memory access instruction and requires the result of the memory access instruction.
46. A digital signal processor as defined in claim 44 wherein said computation stages and said control stages are positioned in the pipeline structure such that a memory access operation of a first instruction is completed before a first computation stage of a second instruction immediately following the first instruction.
47. A digital signal processor as defined in claim 44 wherein said computation stages and said control stages are positioned in the pipeline structure such that said control block has one or more idle stages following completion of the memory access.
48. A digital signal processor as defined in claim 47 wherein said computation stages and said control stages are positioned in the pipeline structure such that said computation block has one or more idle stages prior to a first computation stage.
49. A digital signal processor as defined in claim 44 wherein said pipeline structure has eight stages, including two memory access stages and two computation stages, and wherein said computation stages and said memory access stages are positioned in the pipeline structure such that a result of a second memory access stage of a first instruction is available for a first computation stage of a second instruction immediately following the first instruction.
50. A method for digital signal computation, comprising the steps of: executing computation operations in a computation block having one or more computation stages; executing control operations, including fetching instructions, decoding instructions and accessing a memory, in a control block having one or more control stages, wherein said computation stages and said control stages are configured in a pipeline structure; and positioning said computation stages relative to said control stages in the pipeline structure such that a result of a memory access is available to the computation stages without stalling said computation stages.
51. A method for determining an output of a finite impulse response digital filter having L filter coefficients in response to a set of M input samples, comprising the steps of:
(a) loading a first input sample into a first location in a first register; (b) loading a second input sample into a second location in said first register;
(c) loading two coefficients into a second register; (d) computing intermediate results using the contents of the first and second registers;
(e) loading a new input sample into the first location in said first register;
(f) computing intermediate results using the contents of the first and second registers;
(g) repeating steps (b) - (f) for L iterations to provide two output samples; and
(h) repeating steps (a) - (g) for M/2 iterations to provide M output samples.
52. A method as defined in claim 51 wherein said input samples and said coefficients are 16 bits each.
53. A method as defined in claim 51 wherein steps (d) and (f) each comprise multiply accumulate operations.
54. A method as defined in claim 51 wherein steps (d) and (f) each comprise selecting operands from the first and second registers in response to computation instructions and performing multiply accumulate operations on the selected operands.
55. A method as defined in claim 51 wherein step (d) comprises a multiply accumulate operation on a first coefficient in said second register and the input sample in the first location in said first register, and a multiply accumulate operation on the first coefficient in said second register and the input sample in the second location in said first register.
56. A method as defined in claim 55 wherein step (f) comprises a multiply accumulate operation on a second coefficient in said second register and the input sample in the first location in said first register, and a multiply accumulate operation on the second coefficient in said second register and the input sample in the second location in said first register.
EP00930720A 1999-05-12 2000-05-12 Digital signal processor computation core Withdrawn EP1188112A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP10184733A EP2267896A3 (en) 1999-05-12 2000-05-12 Method for implementing finite impulse response filters
EP10183715.1A EP2267596B1 (en) 1999-05-12 2000-05-12 Processor core for processing instructions of different formats
EP10184831A EP2267597A3 (en) 1999-05-12 2000-05-12 Digital signal processor having a pipeline structure

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US13376699P 1999-05-12 1999-05-12
US133766P 1999-05-12
PCT/US2000/013232 WO2000068783A2 (en) 1999-05-12 2000-05-12 Digital signal processor computation core

Related Child Applications (1)

Application Number Title Priority Date Filing Date
EP10183715.1A Division EP2267596B1 (en) 1999-05-12 2000-05-12 Processor core for processing instructions of different formats

Publications (1)

Publication Number Publication Date
EP1188112A2 true EP1188112A2 (en) 2002-03-20

Family

ID=22460216

Family Applications (4)

Application Number Title Priority Date Filing Date
EP10183715.1A Expired - Lifetime EP2267596B1 (en) 1999-05-12 2000-05-12 Processor core for processing instructions of different formats
EP10184831A Withdrawn EP2267597A3 (en) 1999-05-12 2000-05-12 Digital signal processor having a pipeline structure
EP00930720A Withdrawn EP1188112A2 (en) 1999-05-12 2000-05-12 Digital signal processor computation core
EP10184733A Withdrawn EP2267896A3 (en) 1999-05-12 2000-05-12 Method for implementing finite impulse response filters

Family Applications Before (2)

Application Number Title Priority Date Filing Date
EP10183715.1A Expired - Lifetime EP2267596B1 (en) 1999-05-12 2000-05-12 Processor core for processing instructions of different formats
EP10184831A Withdrawn EP2267597A3 (en) 1999-05-12 2000-05-12 Digital signal processor having a pipeline structure

Family Applications After (1)

Application Number Title Priority Date Filing Date
EP10184733A Withdrawn EP2267896A3 (en) 1999-05-12 2000-05-12 Method for implementing finite impulse response filters

Country Status (3)

Country Link
EP (4) EP2267596B1 (en)
JP (1) JP2002544587A (en)
WO (1) WO2000068783A2 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002010914A1 (en) * 2000-07-28 2002-02-07 Delvalley Limited A method of processing data
US7174543B2 (en) 2001-08-29 2007-02-06 Analog Devices, Inc. High-speed program tracing
JP4502662B2 (en) * 2004-02-20 2010-07-14 アルテラ コーポレイション Multiplier-accumulator block mode split
EP1849095B1 (en) * 2005-02-07 2013-01-02 Richter, Thomas Low latency massive parallel data processing device
CN100559905C (en) 2005-07-20 2009-11-11 大唐移动通信设备有限公司 baseband chip
US7555514B2 (en) 2006-02-13 2009-06-30 Atmel Corportation Packed add-subtract operation in a microprocessor
JP5481793B2 (en) * 2008-03-21 2014-04-23 富士通株式会社 Arithmetic processing device and method of controlling the same
CA2751388A1 (en) * 2011-09-01 2013-03-01 Secodix Corporation Method and system for mutli-mode instruction-level streaming
FR3021428B1 (en) * 2014-05-23 2017-10-13 Kalray MULTIPLICATION OF BIT MATRICES USING EXPLICIT REGISTERS
CN108334337B (en) * 2018-01-30 2022-02-01 江苏华存电子科技有限公司 Low-delay instruction dispatcher with automatic management function and filtering guess access method
CN113157636B (en) * 2021-04-01 2023-07-18 西安邮电大学 Coprocessor, near data processing device and method

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61255433A (en) * 1985-05-07 1986-11-13 Mitsubishi Electric Corp Arithmetic unit
JPH077356B2 (en) * 1989-05-19 1995-01-30 株式会社東芝 Pipelined microprocessor
US5175863A (en) * 1989-10-23 1992-12-29 International Business Machines Corporation Signal data processing system having independently, simultaneously operable alu and macu
US5926644A (en) * 1991-10-24 1999-07-20 Intel Corporation Instruction formats/instruction encoding
DE69327504T2 (en) * 1992-10-19 2000-08-10 Koninklijke Philips Electronics N.V., Eindhoven Data processor with operational units that share groups of register memories
JPH0876977A (en) * 1994-09-06 1996-03-22 Matsushita Electric Ind Co Ltd Arithmetic unit for fixed decimal point
EP1302848B1 (en) * 1994-12-01 2006-11-02 Intel Corporation A microprocessor having a multiply operation
US5867726A (en) * 1995-05-02 1999-02-02 Hitachi, Ltd. Microcomputer
TW424192B (en) * 1995-05-02 2001-03-01 Hitachi Ltd Microcomputer
CN1264085C (en) * 1995-08-31 2006-07-12 英特尔公司 Processor capable of executing packet shifting operation
HUP9900030A3 (en) * 1995-08-31 1999-11-29 Intel Corp An apparatus for performing multiply-add operations on packed data
AU6905496A (en) * 1995-09-01 1997-03-27 Philips Electronics North America Corporation Method and apparatus for custom operations of a processor
US5710914A (en) * 1995-12-29 1998-01-20 Atmel Corporation Digital signal processing method and system implementing pipelined read and write operations
US5822606A (en) * 1996-01-11 1998-10-13 Morton; Steven G. DSP having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
US5954811A (en) 1996-01-25 1999-09-21 Analog Devices, Inc. Digital signal processor architecture
JP3658072B2 (en) * 1996-02-07 2005-06-08 株式会社ルネサステクノロジ Data processing apparatus and data processing method
GB2317466B (en) * 1996-09-23 2000-11-08 Advanced Risc Mach Ltd Data processing condition code flags
US6530014B2 (en) * 1997-09-08 2003-03-04 Agere Systems Inc. Near-orthogonal dual-MAC instruction set architecture with minimal encoding bits
US6260137B1 (en) 1997-09-12 2001-07-10 Siemens Aktiengesellschaft Data processing unit with digital signal processing capabilities

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0068783A2 *

Also Published As

Publication number Publication date
EP2267596B1 (en) 2018-08-15
WO2000068783A3 (en) 2001-08-09
WO2000068783A2 (en) 2000-11-16
EP2267596A3 (en) 2012-01-04
EP2267896A2 (en) 2010-12-29
EP2267597A3 (en) 2012-01-04
EP2267896A3 (en) 2013-02-20
EP2267596A2 (en) 2010-12-29
JP2002544587A (en) 2002-12-24
EP2267597A2 (en) 2010-12-29

Similar Documents

Publication Publication Date Title
EP1047989B1 (en) Digital signal processor having data alignment buffer for performing unaligned data accesses
US7937559B1 (en) System and method for generating a configurable processor supporting a user-defined plurality of instruction sizes
JP3983857B2 (en) Single instruction multiple data processing using multiple banks of vector registers
EP0927393B1 (en) Digital signal processing integrated circuit architecture
US6002881A (en) Coprocessor data access control
US8412917B2 (en) Data exchange and communication between execution units in a parallel processor
EP1692611B1 (en) Method and apparatus for performing packed data operations with element size control
US5881257A (en) Data processing system register control
US5784602A (en) Method and apparatus for digital signal processing for integrated circuit architecture
US6374346B1 (en) Processor with conditional execution of every instruction
EP1124181B1 (en) Data processing apparatus
US5969975A (en) Data processing apparatus registers
US6615341B2 (en) Multiple-data bus architecture for a digital signal processor using variable-length instruction set with single instruction simultaneous control
KR20170110689A (en) A vector processor configured to operate on variable length vectors using digital signal processing instructions,
US7308559B2 (en) Digital signal processor with cascaded SIMD organization
US5881259A (en) Input operand size and hi/low word selection control in data processing systems
WO2003098379A2 (en) Method and apparatus for adding advanced instructions in an extensible processor architecture
CN104133748B (en) To combine the method and system of the correspondence half word unit from multiple register cells in microprocessor
US6496920B1 (en) Digital signal processor having multiple access registers
US7111155B1 (en) Digital signal processor computation core with input operand selection from operand bus for dual operations
EP2267596B1 (en) Processor core for processing instructions of different formats
WO1998020422A1 (en) Eight-bit microcontroller having a risc architecture
US6915411B2 (en) SIMD processor with concurrent operation of vector pointer datapath and vector computation datapath
US7107302B1 (en) Finite impulse response filter algorithm for implementation on digital signal processor having dual execution units
US6859872B1 (en) Digital signal processor computation core with pipeline having memory access stages and multiply accumulate stages positioned for efficient operation

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20011212

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

RIN1 Information on inventor provided before grant (corrected)

Inventor name: EDMONDSON, JOHN

Inventor name: RIVIN, RUSSEL, L.

Inventor name: ANDERSON, WILLIAM, CAROLL

Inventor name: FRIDMAN, JOSE

Inventor name: HOFFMAN, MARC

RBV Designated contracting states (corrected)

Designated state(s): DE FR GB

17Q First examination report despatched

Effective date: 20080208

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20150306