US20110273459A1 - Device for the parallel processing of a data stream - Google Patents
Device for the parallel processing of a data stream Download PDFInfo
- Publication number
- US20110273459A1 US20110273459A1 US13/121,417 US200913121417A US2011273459A1 US 20110273459 A1 US20110273459 A1 US 20110273459A1 US 200913121417 A US200913121417 A US 200913121417A US 2011273459 A1 US2011273459 A1 US 2011273459A1
- Authority
- US
- United States
- Prior art keywords
- data
- computation
- processing
- registers
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8007—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
- G06F15/8015—One dimensional arrays, e.g. rings, linear arrays, buses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Definitions
- the invention relates to a device for processing a data stream. It lies in the field of computation architectures and finds particular utility in embedded applications of multimedia type integrating a video sensor. It involves notably mobile telephony, mobile multimedia readers, photographic apparatus and digital camcorders. The invention also finds utility in applications relating to telecommunications and, more generally, in any signal processing chain for processing digital data at high rate.
- the “data flow” mode is a data processing mode according to which the data entering the computation module are processed as and when they arrive, at the rate of their arrival, a result being provided as output from the computation module at the same rate, optionally after a latency time.
- Dedicated computation modules make it possible to comply with the fabrication cost constraints on account of their small silicon area and the performance constraints, notably as regards computational power and electrical consumption.
- such modules suffer from a flexibility problem, it not being possible for the processing operations supported to be modified after the construction of the modules.
- these modules are parametrizable. Stated otherwise, a certain number of processing-related parameters may be modified after construction.
- a solution to this lack of flexibility consists in using completely programmable processors.
- the processors most commonly used are signal processors, well known in the literature under the acronym “DSP” for “Digital Signal Processor”.
- DSP Digital Signal Processor
- Drawbacks of these processors are their significant silicon footprint and their electrical consumption, often rendering them ill-adapted to highly constrained embedded applications.
- a circuit comprises a data processing unit having very long instruction words, called a VLIW (“Very Long Instruction Word”) unit, and a unit making it possible to execute an instruction on several computation units, called an SIMD (“Single Instruction Multiple Data”) unit.
- VLIW Very Long Instruction Word
- SIMD Single Instruction Multiple Data
- computation units of VLIW and/or SIMD type are implanted in the circuit as a function of the necessary computational power. The choice of the type of unit to be included in the circuit, of their number and of the way they are chained together is decided before the construction of the circuit by analyzing the application code and necessary resources.
- the order in which the units are chained together is fixed and it does not make it possible to subsequently change the chaining of the processing operations.
- the units are globally fairly complex since the control code for the application is not separate from the processing code.
- the processing operators of these units are of significant size, thereby leading to an architecture whose silicon area and electrical consumption are more significant for equal computational power.
- a C-language code may be transformed into a set of elementary instructions by a specific compiler.
- the set of instructions is then implanted on a configurable matrix of predefined operators.
- FPGA Field Programmable Gate Array
- This technology may be compared with that of FPGA, which is the acronym for “Field Programmable Gate Array”, the computation grain being bigger. It does not therefore make it possible to obtain programmable circuits, but only circuits that can be configured by code compilation. If it is desired to integrate parts of program code that are not provided for at the outset, computation resources which are not present in the circuit are then necessary. It therefore becomes difficult or indeed impossible to implement this code.
- the data are processed by a so-called parallel architecture.
- Such an architecture comprises several computation tiles linked together by an interconnection bus.
- Each computation tile comprises a storage unit making it possible to store the data locally, a control unit providing instructions for carrying out processing on the stored data, processing units carrying out the instructions received from the control unit on the stored data and an input/output unit conveying the data either between the interconnection bus and the storage unit, or between the processing units and the interconnection bus.
- This architecture presents several advantages.
- a first advantage is the possibility of modifying the code to be executed by the processing units, even after the construction of the architecture.
- the code to be executed by the processing units generally comprises only computation instructions but no control or address computation instruction.
- a second advantage is the possibility of carrying out in parallel, either an identical processing on several data, or more complex processing operations for one and the same number of clock cycles by profiting from the parallel placement of the processing units.
- a third advantage is that the computation tiles may be chained together according to the processing operations to be carried out on the data, the interconnection bus conveying the data between the computation tiles in a configurable order.
- the parallel architecture may be extended by adding further computation tiles, so as to adapt its processing capabilities to the processing operations to be carried out.
- the management of the data in the computation tiles is complex and generally requires significant memory resources. In particular, when a computation tile is performing a processing on a data neighborhood, all the data of this neighborhood must be available to it simultaneously, whereas the data arrive in the form of a continuous stream.
- the storage unit of the computation tile must then store a significant part of the data of the stream before being able to perform a processing on a neighborhood.
- This storage and the management of the stored data require optimization so as to limit the silicon area and the electrical consumption of the parallel architecture while offering computational performance adapted to the processing of a data flow.
- An aim of the invention is to propose a computation structure which is programmable and adapted to the processing of a data stream, notably when processing operations must be carried out on neighborhoods of data.
- the subject of the invention is a device for processing a data stream originating from a device generating matrices of NI rows by Nc columns of data.
- the processing device comprises K computation tiles and interconnection means for transferring the data stream between the computation tiles.
- At least one computation tile comprises:
- An advantage of the invention is that the storage unit of a computation tile in which a processing on a neighborhood of data is carried out is particularly adapted to such processing, notably in terms of dimensioning of the memory registers and management of accesses to the memory registers by the processing units.
- FIG. 1 an exemplary device for processing a data stream according to the invention
- FIG. 2 an exemplary processing unit comprising a very long instruction word processor
- FIG. 3 an exemplary management of a block of shaping memories
- FIG. 4 an exemplary management of a block of neighborhood registers in a case where the data of the block of shaping memories are in order
- FIG. 5 an exemplary management of the block of neighborhood registers in the case where the data of the block of shaping memories are not in order
- FIG. 6 a set of timecharts illustrating the temporal management of a block of neighborhood registers
- FIG. 7 an exemplary embodiment of a computation tile comprising several processing units in parallel
- FIG. 8 a set of timecharts illustrating the temporal management of a storage unit of a computation tile comprising two processing units in parallel
- FIG. 9 an exemplary embodiment of an input/ output unit
- FIG. 10 an exemplary implementation of the device according to the invention for video images
- FIG. 11 a schematic representation of a Bayer filter
- FIG. 12 an example of format registers allowing the splitting of the data of the stream
- FIG. 13 an exemplary mechanism allowing access to a register containing metadata
- FIG. 14 an exemplary embodiment of a computation tile comprising several processing units, the processing units receiving specific instructions as a function of metadata,
- FIG. 15 an exemplary embodiment of an insertion operator.
- the subsequent description is given in relation to a processing chain for a video data stream originating from a video sensor such as a CMOS sensor.
- the processing chain makes it possible for example to reconstruct color images on the basis of a monochrome video sensor to which is applied a color filter, for example a Bayer filter, to improve the quality of the images restored, or else to carry out morphological operations such as erosion/dilation or the low-level part in the processing of pixels of advanced applications such as image stabilization, red eye correction or the detection of faces.
- the device according to the invention may be equally suitable for processing a stream of data other than those arising from a video sensor.
- the device can for example process an audio data stream or data in Fourier space.
- the device exhibits particular interest for the processing of data which, although being conveyed in the form of a stream, possess a coherence in a two-dimensional space.
- FIG. 1 schematically represents a device 1 for processing a data stream according to the invention.
- a video sensor 2 generates a digital data stream directed toward the processing device 1 , by way of a data bus 3 .
- the data arising from the video sensor 2 are referred to as raw data.
- the device 1 processes these raw data so as to generate as output data referred to as final data.
- the device 1 according to the invention comprises processing units UT, control units UC, storage units UM and input/output units UES grouped into K computation tiles TC.
- the device 1 also comprises interconnection means 4 such as data buses 41 , 42 .
- Each computation tile TC comprises a storage unit UM, one or more control units UC, at least one processing unit UT per control unit UC and an input/output unit UES.
- the storage units UM make it possible to shape the data of the stream so that they can be processed by the processing units UT as a function of code instructions delivered by the control units UC.
- the input/output units UES make it possible to convey the data stream between the interconnection means 4 and the storage units UM on the one hand, and between the processing units UT and the interconnection means 4 on the other hand. In the example of FIG.
- the device 1 comprises 4 computation tiles TC, the first and the fourth computation tile TC 1 and TC 4 each comprising a storage unit UM, a control unit UC, a processing unit UT and an input/output unit UES, the second computation tile TC 2 comprising a storage unit UM, a control unit UC, two processing units UT and an input/output unit UES, and the third computation tile TC 3 comprising a storage unit UM, two control units UC, two processing units UT per control unit UC and an input/output unit UES.
- Each computation tile TC makes it possible to carry out a function or a series of functions on the basis of code instructions.
- each computation tile TC carries out for example one of the following functions: correction of the white balance, dematrixing, noise reduction, contour accentuation.
- the composition of a computation tile TC depends notably on the function or functions that it has to carry out.
- the number of control units UC making up a computation tile TC depends on the number of different processing operations having to be carried out simultaneously by the computation tile TC. Since each control unit UC within the computation tile TC is able to comprise its own code, a computation tile TC comprises for example as many control units UC as distinct processing operations to be carried out in parallel on the data.
- the processing units UT may be more or less complex. In particular, they can comprise either simple dedicated operators, for example composed of logic blocks, or processors. Each processing unit UT is independent of the others and can comprise different operators or processors.
- the dedicated operators are for example multipliers, adders/ subtracters, assignment operators or shift operators.
- the processing units UT contain only the dedicated operators commonly used for the processing envisaged.
- a processing unit UT can also comprise a processor.
- the processor comprises a single arithmetic and logic unit.
- the processor is a very long instruction word (VLIW) processor.
- VLIW very long instruction word
- Such a processor can comprise several arithmetic and logic units.
- a VLIW processor comprises for example instruction decoders, now no arithmetic and logic units but only computation operators, a local memory and data registers.
- instruction decoders now no arithmetic and logic units but only computation operators, a local memory and data registers.
- only the computation operators necessary for the execution of the computation codes to be carried out are implanted in the processor during its design. Thereafter, two or more of them may be used in the same cycle to perform distinct operations in parallel. The unused operators do not receive the clock signals.
- the VLIW processor comprises two pathways. Stated otherwise, it can execute up to two instructions in one and the same clock cycle.
- the processor comprises a first instruction decoder 21 , a second instruction decoder 22 , a first set of multiplexers 23 , a set of computation operators 24 , a second set of multiplexers 25 , a set of data registers 26 and a local memory 27 .
- the instruction decoders 21 and 22 receive instructions originating from a control unit UC.
- the multiplexers 23 direct data to be processed onto an input of one of the computation operators 24 and the multiplexers 25 direct the processed data to the data registers 26 .
- the data registers 26 containing the processed data may be linked up with outputs of the processor.
- the size of the very long instruction words is for example 48 bits, i.e. 24 bits per pathway.
- the computation operators 24 thus work in 24-bit precision.
- the computation operators 24 are advantageously two adders/subtracters, a multiplier, an assignment operator, an operator for writing to the local memory and a shift operator.
- the execution of the instructions may be conditioned by setting a flag.
- the instruction can then be supplemented with a prefix indicating the execution condition.
- the flag is for example a bit of a register containing the result of an instruction executed during the previous clock cycle. This bit can correspond to the zero, sign or carry indicators of the register.
- the instruction decoders 21 and 22 test the setting of the flag related to this instruction. If this setting complies with the execution condition, the operation is executed, otherwise it is replaced with a non-operation instruction, called NOP.
- NOP non-operation instruction
- each instruction word is coded on 24 bits.
- the first 3 bits (bits 0 to 2 ) can contain the instruction condition, the following two bits (bits 3 and 4 ) can code the mode of access to the datum, the sixth, seventh and eighth bits (bits 5 to 7 ) can code the identifier of the operation, the following four bits (bits 8 to 11 ) can designate the destination register, the following four bits (bits 12 to 15 ) can designate the source register and the last 8 bits (bits 16 to 23 ) can contain a constant.
- An exemplary programming using such a coding is given in the annex.
- the device 1 for processing a data stream comprises M control units UC, M lying between 1 and N, N being the number of processing units UT.
- each processing unit UT can have its own control unit UC.
- at least one computation tile TC comprises several processing units UT, as in the example of FIG. 1 (TC 2 , TC 3 ).
- a control unit UC of this computation tile TC then provides instructions to several processing units UT, these processing units UT being said to be in parallel.
- a control unit UC can comprise a memory making it possible to store the code instructions for the processing unit or units UT that it serves.
- a control unit UC can also comprise an ordinal counter, an instruction decoder and an address manager.
- the address manager and the ordinal counter make it possible to apply a different processing as a function of the color of the current pixel.
- the code may be split up into code segments, each code segment comprising instructions for one of the colors of the filter.
- the address manager can indicate to the ordinal counter the color of the current pixel, for example red, green or blue.
- the address manager comprises a two-bit word making it possible to code up to four different colors or natures of pixels in a pixel neighborhood of size two by two.
- the ordinal counter is incremented by a shift value (offset) depending on the value of the word.
- the ordinal counter then makes it possible to point at the code segment corresponding to the color of the current pixel.
- the four shift values are determined during compilation of the code as a function of the number of instructions of each of the code segments.
- the use of an address manager and of an ordinal counter makes it possible to unburden the programmer and thus avoids his having to determine the nature of the current pixel per program. This management becomes automatic and allows a shorter execution time and simpler programming.
- the same instructions are applied to all the pixels.
- the shift values are then equal and determined so that the ordinal counter points at the first instruction after the initialization code.
- the device 1 for processing a data stream also comprises K storage units UM, K lying between 1 and M.
- a computation tile TC can comprise several control units UC, as in the example of FIG. 1 (TC 3 ).
- the same data of the stream, or neighboring data, that are present in the storage unit UM can then be processed differently by the processing units UT of the computation tile, each control unit UC providing instructions to at least one processing unit UT.
- the main function of the storage units UM is to shape the data of the stream so as to facilitate access to these data by the processing units UT.
- a storage unit UM comprises an equal number of data registers to the number of processing units UT situated in the computation tile TC of the storage unit UM considered.
- a storage unit UM shapes the data in the form of neighborhoods and manages access to the data when processing units UT are in parallel.
- a storage unit UM can comprise a first memory block called a block of shaping memories and a second memory block called a block of neighborhood registers. Since the storage units UM of the various computation tiles TC are independent of one another, the device 1 for processing the data stream can comprise at one and the same time storage units UM according to the first embodiment and storage units UM according to the second embodiment.
- the second embodiment makes it possible to carry out processing operations on data neighborhoods.
- a neighborhood may be defined as a mesh of adjacent pixels, this mesh generally being square or at least rectangular.
- a rectangular mesh may be defined by its dimension VI ⁇ Vc where VI is the number of pixels of the neighborhood row-wise and Vc is the number of pixels of the neighborhood column-wise.
- the block of shaping memories stores the data of the stream so that they can be copied in a systematic manner with each arrival of a new datum.
- the block of neighborhood registers allows access to the pixels of the current neighborhood by the processing unit or units UT of the computation tile considered.
- FIG. 3 illustrates, by a block 31 of shaping memories represented at different time steps T, an exemplary management of the block 31 for data corresponding to a stream of values of pixels originating from a device generating matrices of NI rows by Nc columns of data, such as a video sensor 32 .
- the video sensor 32 has resolution Nc columns by NI rows of pixels. The resolution is for example VGA (640 ⁇ 480), “HD Ready” (1080 ⁇ 720) or “Full HD” (1920 ⁇ 1080).
- the pixels are dispatched and stored as and when they arrive to the block 31 of shaping memories.
- This block 31 is advantageously of dimension VIxNc so as to make it possible to generate neighborhoods of dimension VI ⁇ Vc.
- the block 31 comprises VI ⁇ Nc memory cells arranged according to a mesh of VI rows and Nc columns. Usual values for VI are three, four, five, six or seven. Physically, the block 31 can consist of one or more memory modules.
- the block 31 may be managed as a shift register. Stated otherwise, at each time step or clock cycle, the data are shifted so as to leave room for the new incoming datum.
- the block 31 is managed as a conventional memory in such a way that the pixels are copied in their order of arrival.
- a counter CPT that is incremented on each incoming datum is considered.
- Each new pixel coming from the data stream is then copied into a cell 33 of the block 31 of shaping memories situated in the row corresponding to E(CPT/Nc), where E(x) is the function returning the integer part of a number x, and in the column corresponding to the remainder of CPT/Nc.
- the counter CPT is reset to zero each time it reaches the value equal to VI ⁇ Nc.
- a counter CPTC that is incremented after each incoming datum and a counter CPTL that is incremented each time the counter CPTC reaches the value Nc are considered.
- the counter CPTC is reset to zero each time it reaches the value Nc and the counter CPTL is reset to zero each time it reaches the value VI.
- Each new pixel coming from the data stream is then copied into the cell 33 whose row index corresponds to the value CPTL and whose column index corresponds to the value CPTC.
- FIG. 4 illustrates an exemplary management of the block of neighborhood registers for data originating from the block 31 of shaping memories.
- the block 34 of neighborhood registers comprises for example a number of neighborhood registers equal to VI ⁇ Vc. These neighborhood registers are arranged in the same manner as the neighborhood of pixels, that is to say they form a mesh of VI rows and Vc columns of registers.
- the copying of the data of the block 31 of shaping memories to the neighborhood registers starts as soon as there is a number of data in the block 31 equal to (VI ⁇ 1) ⁇ Nc+1 . In the case of a neighborhood of dimension 3 ⁇ 3, represented in FIG. 4 , the copying of the data thus starts when two rows of data plus one datum are present in the block 31 .
- the data are copied at each clock cycle in groups of VI data of one and the same column.
- the index of the column to be copied is given by the value of CPTC.
- This column in fact comprises the last pixel that arrived in the block 31 .
- a column 35 of VI data registers is added to the neighborhood registers. This column 35 makes it possible to disable accesses to the registers of the block 34 by the processing units UT only during a single clock cycle, that of the shifting of the values in the block 34 . Otherwise, accesses are disabled at one and the same time during the shifting of the values and during the copying of the data from the block 31 .
- the data of the column, indicated by the counter CPTC, of the block 31 are copied into the registers of the column 35 .
- all the data of the block 34 and of the column 35 are shifted by one column.
- the data of a first column 341 are shifted toward a second column 342
- the data of this column 342 are shifted toward a third column 343 and the data of the column 35 are shifted toward the column 341 .
- the data are not always stored in the block 31 according to the order of the rows of the video sensor 32 .
- the pixels must be copied into the column 35 or, if appropriate, into the column 341 of the block 34 , in a different order.
- FIG. 5 illustrates such a case where the last data of the stream are stored in the first row of the block 31 .
- the copying of the pixels into column 35 may be managed by the following placement steps:
- the pixel of the block 31 of shaping memories which is situated in the row RowNo and in the column indicated by CPTC is notably copied to column 35 , or, if appropriate, into the first column 341 of the block 34 , in the row defined by (CPTL+RowNo+1) modulo VI.
- RowNo takes all the positive integer values lying between 1 and VI so as to allow the copying of the pixels for all the rows of the neighborhood.
- the copying of the pixels of the block 31 into the column 35 of registers is not performed simultaneously with the shifting of the pixels in the block 34 .
- This embodiment allows the processing units UT to access the data present in the block 34 of neighborhood registers during a longer period.
- FIG. 6 represents a set of timecharts making it possible to implement this embodiment.
- the temporal shift between the copying of the pixels and the shifting of the pixels into the block 34 may be effected by introducing, in addition to a first clock, called the pixel clock 61 and making it possible to regulate the data stream and the copying of the pixels, a second clock, called the shifted pixel clock 62 .
- This shifted pixel clock 62 may be at the same frequency as the pixel clock 61 but shifted in time.
- This shift corresponds for example to a period of the clock of the processing units UT 63 .
- the data present in the block 34 are then accessible throughout the period separating two clock ticks of the shifted pixel clock 62 .
- Access to the neighborhood registers by the processing units UT may be effected through an input/output port, for example integrated into each processing unit UT, the number of whose connections is equal to the number of neighborhood registers, multiplied by the size of the data.
- Each neighborhood register is linked to the input/output port.
- each storage unit UM comprises a multiplexer the number of whose inputs is equal to the number of neighborhood registers of the block 34 and the number of whose outputs is equal to the number of data that can be processed simultaneously by the processing unit UT of the computation tile TC considered.
- the processing unit UT can then comprise an input/output port the number of whose connections is equal to the number of data that can be processed simultaneously, multiplied by the size of the data.
- a processing unit UT comprising a VLIW processor with two pathways processing data on 12 bits can comprise an input/output port with 24 (2 ⁇ 12) connections.
- one and the same storage unit UM provides data to several processing units UT in parallel.
- the processing device 1 comprises a computation tile TC comprising several processing units UT.
- This embodiment advantageously uses the storage units UM comprising a block 31 of shaping memories and a block 34 of neighborhood registers.
- the dimension of the block 34 of neighborhood registers has to be adapted.
- FIG. 7 illustrates an exemplary computation tile TC where a storage unit UM provides data to n processing units UT in parallel, n being less than or equal to the number N of processing units UT of the device 1 .
- the instructions are provided to the n processing units UT by one or more control units UC.
- the copying of the data of the block 31 to the column 35 of registers then starts when the block 31 of shaping memories comprises (VI ⁇ 1) ⁇ Nc+1 data.
- the processing of the data is carried out when n new data have arrived in the block 31 .
- Access to the neighborhood registers by the n processing units UT can also be carried out through an input/output port integrated into each processing unit UT.
- the number of connections of the input/output port of each processing unit UT is then equal to the number of neighborhood registers to which the processing unit UT requires access, multiplied by the size of the data.
- the storage unit UM can comprise a multiplexer the number of whose inputs is equal to the number of neighborhood registers of the block 34 and the number of whose outputs is equal to the number of data that can be processed simultaneously by the n processing units UT, each processing unit UT comprising an input/output port the number of whose connections is equal to the number of data that can be processed simultaneously by said processing unit UT, multiplied by the size of the data.
- FIG. 8 illustrates, by a set of timecharts, an exemplary management of a computation tile TC comprising two processing units UT in parallel.
- a first timechart 81 represents the clock of the processing units UT of rate F archi .
- a second timechart 82 represents the pixel clock of rate F pixel .
- the pixel clock fixes the rate of arrival of the data of the stream which are dispatched into the block 31 of shaping memories.
- the rate F archi may be equal to p ⁇ F pixel with p a positive integer. According to FIG. 8 , the rate F pixel is four times greater than the rate F archi .
- Each processing unit UT thus has four clock cycles per datum to be processed.
- a third timechart 83 represents a shift clock.
- This clock generates two successive clock ticks 831 , 832 after one clock tick out of two of the pixel clock.
- the data of the block 34 are shifted by one column.
- a fourth timechart 84 represents the shifted pixel clock.
- the rate of this clock is substantially equal to half the rate a clock tick 840 being generated after the two clock ticks 831 , 832 of the shift clock.
- the rate of the shifted pixel clock is equal to 1/n times the rate F pixel of the pixel clock.
- the data are copied from the block 31 to the block 35 . Access to the neighborhood registers by the processing units UT is possible between two clock ticks 840 of the shifted pixel clock.
- the interconnection means 4 comprise a number Nb_bus of data buses.
- Nb_bus may be defined by the following relation:
- Nb _bus K ⁇ ( F pixel /F archi )+1.
- This embodiment makes it possible to connect the K computation tiles TC to one another by carrying out a spatiotemporal multiplexing whose temporal multiplexing ratio Mux_t is defined by the relation:
- each input/output unit UES can manage the read-access and write-access authorizations as a function of the number Nb_bus of buses and of the temporal multiplexing ratio Mux_t.
- each input/output unit UES can comprise registers making it possible to determine the time intervals during which the computation tile TC considered has a read-access or write-access authorization on one of the data buses and, for each of these time intervals, the data bus for which the read- or write-access is authorized.
- An input/output unit UES comprises for example, for the management of the write-access authorizations, Nb_bus registers of size log2(Mux_t) bits, where log2(x) is the function returning the logarithm to base 2 of the number x and, for the management of the read-access authorizations, a register of size log2(Nb_bus) bits specifying the index of the bus to be read and a register of size log2(Mux_t) bits specifying the time interval.
- An exemplary embodiment of such an input/output unit UES is represented in FIG. 9 .
- the input/output unit UES comprises two registers 91 and 92 of 2 bits each, the register 91 managing the write-access authorization on the bus 41 and the register 92 managing the write-access authorization on the bus 42 .
- the content of the registers 91 and 92 is compared with the value of the current time interval, for example by comparators 93 and 94 and, in the case of equality, the writing of the data is authorized on the bus 41 or 42 concerned.
- the input/output unit UES also comprises a register 95 of 1 bit specifying the index of the bus 41 or 42 to be read and a register 96 of 2 bits specifying the time interval for the reading.
- the content of the register 96 is also compared with the current time interval, for example by a comparator 97 and, in the case of equality, the reading of the data is authorized on the bus 41 or 42 concerned.
- This embodiment exhibits the advantage that each input/output unit UES individually manages the access authorizations between the computation tiles TC and the buses 41 and 42 . Consequently, no centralized control facility is necessary.
- the value of the registers of each input/output unit UES is fixed when booting the system as a function of the desired chaining of the computation tiles TC. An unused computation tile TC will be able to have the values of the registers of its input/output unit UES initialized so as to have no right to read or write on the bus 41 or 42 .
- each computation tile TC furthermore comprises a serial block BS comprising as many data registers as processing units UT present in the tile considered, the size of the registers being of size at least equal to the size of the data of the stream.
- the serial block BS of a computation tile TC receives as input the data originating from the processing unit or units UT and is connected at output to the input/output unit UES.
- the data present in the serial block (BS) are dispatched sequentially on this bus 41 or 42 .
- FIG. 10 illustrates an exemplary implementation of the data stream processing device 1 for processing operations to be carried out on raw images.
- the raw images arise for example from a Bayer filter 110 , for example represented in FIG. 11 .
- a color image consists of a mosaic of red, green and blue colored pixels.
- the mosaic consists of an alternation of blue and green pixels on a first type of row and of an alternation of green and red pixels on a second type of row, the types of rows also being alternated so as to form diagonals of green pixels.
- the device 1 according to the invention is particularly adapted to such data. Indeed, for each type of row, it is possible to construct a computation tile TC capable of simultaneously processing several pixels although they are of different color. In one embodiment, represented in FIG.
- the computation tile TC comprises, on the one hand, a first control unit UC 1 providing a first code to a first and to a third processing unit UT 1 and UT 3 and, on the other hand, a second control unit UC 2 providing a second code to a second and to a fourth processing unit UT 2 and UT 4 .
- the first code is specific to a first color of pixel, for example red
- the second code is specific to a second color of pixel, for example green.
- the code can also be split up into code segments, an address manager then indicating the color of the processed pixel to the control units UC 1 and UC 2 .
- the processing units UT 1 , UT 2 , UT 3 and UT 4 then act on the data present in the block 34 of neighborhood registers as a function of the instructions that they receive.
- the first and third processing units UT 1 and UT 3 act on red pixels and the second and fourth processing units UT 2 and UT 4 act on the green pixels.
- the computation tile thus makes it possible to process simultaneously, but distinctly, four pixels of the block 34 of neighborhood registers.
- two control units UC 1 and UC 2 per row suffice since a row comprises only two different colors. Quite obviously, the computation tiles may be adapted as a function of the color filter applied.
- the data in transit on the interconnection means 4 can contain the datum to be processed, but also additional information, called metadata.
- metadata may be used for the transport of various information associated with the data.
- a metadatum contains for example a value representative of a noise correction or gain correction to be applied to the pixels. The same correction can thus be applied to all the pixels of the image.
- the metadata can also relate to the three values R, G and B (Red, Green and Blue), intermediate results to be associated with a pixel, or else information making it possible to control the program as a function of the characteristics of the pixel.
- the use of these metadata makes it possible to easily split the algorithms into several parts and to execute them on various computation tiles TC in multi-SIMD mode.
- the data in transit may be split according to various formats.
- the splitting format is specified in format registers M r i, as represented in FIG. 12 (four in number in this particular case M r 0 , M r 1 , M r 2 , M r 3 ).
- M r i as represented in FIG. 12
- a 24-bit data word has been divided into three parts 121 , 122 , 123 whereas a split of up to four was possible.
- These format registers are defined for a computation tile TC and their respective value is fixed when loading the program. Accesses to the data are thereafter systematically carried out by a mechanism such as presented in FIG.
- FIG. 13 composed of multiplexers 131 a , 131 b , 131 c , 131 d , of shift registers 132 a, 132 b and of logical gates 133 a , 133 b , 133 c , 133 d , 137 .
- Depicted in this FIG. 13 are the format registers M r i 134 associated with position registers D r i 135 . These registers D r i 135 are deduced from the registers M r i 134 by the software for loading the parameters of the program. They make it possible to obtain the start position of the metadatum considered.
- the format registers 134 are linked by a multiplexer 131 b controlled by the current position to be recovered CMr to a logical AND cell 133 d making it possible to set to zero the bits of the register 136 for which the position defined is not relevant, and then the position registers 135 are linked to a shift register 132 b which makes it possible to shift the previous result the appropriate number of times so as to have a right-shifted value which is the final value to be recovered for the position considered.
- the datum to be written VaI is shifted to the appropriate position by virtue of the position register 135 linked to a multiplexer 131 c controlled by the current position to be written CMw, which multiplexer 131 c is itself linked to a shift register 132 a . Only the bits concerned are not set to zero by virtue of a logical AND cell 133 a linked to the multiplexer 131 a which gives the bits to be masked by virtue of the format register 134 . Finally these bits are concatenated with those already present in the destination register 136 .
- the format register 134 is inverted by a logical NOT cell 137 and then a logical AND is carried out by a cell 133 b between the inverted format register and the value of the register 136 , thus making it possible to fabricate a new mask which drives a logical OR cell 133 c making it possible to associate the new datum with the register 136 without touching the bits that are not relevant.
- the processing device according to the invention can thus be advantageously used for the processing operations in the Fourier domain (frequency domain) for example.
- the processing device according to the invention can also be adapted for accelerating the emulation of floating-point numbers with the aid of fixed-point operators.
- the metadata may be used to determine the instructions to be transmitted to the processors.
- the complementary information can indicate the specific processing to be executed on the data with which they are associated. It suffices to extract the necessary information from the data word as soon as it enters the computation tile TC, illustrated by the example of FIG. 14 , and to transmit it to the control unit UC which manages the processing units UT in multi-SIMD mode.
- the metadatum may be extracted when the neighborhood manager organizes the data so as to transmit them to the processing units UT via multiplexers 141 .
- An additional communication 142 between the input/output unit UES and the control unit UC allows the transfer of the metadatum, as represented in FIG. 15 .
- the interconnection means 4 may be adapted in terms of capacity.
- the size of the data buses 41 , 42 may be increased as a function of the size of the metadata.
- the device 1 according to the invention can comprise an insertion operator making it possible to concatenate each datum of the stream with a metadatum.
- FIG. 15 represents such an insertion operator 150 .
- the insertion operator 150 comprises an input bus 151 linked to an input of an insertion block 152 whose output is linked to an output bus 153 .
- the insertion operator 150 can also comprise a memory 154 making it possible to store the metadatum.
- the memory 154 is linked with the insertion block 152 so as to allow the transfer of the metadatum.
- the size of this datum must be less than or equal to the difference between the maximum size of the data that can be transferred by the interconnection means 4 and the size of the data of the stream.
- the size of the input bus 151 must be adapted to the size of the data of the stream whereas the size of the output bus 153 must be adapted to the size of the data of the stream that are concatenated with the metadatum.
- the insertion operator 150 may be inserted on one of the data buses 41 , 42 , for example between the video sensor 2 and the computation tiles TC or between two computation tiles TC. In one embodiment, the insertion operator 150 is embodied by a computation tile TC.
- the computation tile TC then comprises a storage unit UM containing the complementary datum and a processing unit UT making it possible to concatenate the data of the stream with the complementary datum.
- the complementary datum is for example stored in a data register of the storage unit UM.
- the VLIW allows work on two pathways:
- a signed shift operator makes it possible to shift the values to the right or left according to the sign of the shift
- a shift of 0 is equivalent to an assignment
- This code carries out the following operation: For a 2 ⁇ 2 neighborhood, sets R 1 to the average of the pixels of the neighborhood and sets R 0 to 255 if the value of the average is >128; increments R 2 if the pixel is at 255 (to have the count of the pixels >128 at the end of the processing)
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Image Processing (AREA)
Abstract
Description
- The invention relates to a device for processing a data stream. It lies in the field of computation architectures and finds particular utility in embedded applications of multimedia type integrating a video sensor. It involves notably mobile telephony, mobile multimedia readers, photographic apparatus and digital camcorders. The invention also finds utility in applications relating to telecommunications and, more generally, in any signal processing chain for processing digital data at high rate.
- Signal processing in general, and processing of images in particular, require significant computational powers, especially over the last few years with the rapid increase in the resolution of image sensors. In the field of embedded applications aimed at the general public, heavy constraints in terms of fabrication cost are added to the constraints of electrical consumption (of the order of a few hundred milliwatts). To respond to these constraints, image processing is commonly carried out on the basis of dedicated computation modules operating in data flow mode. The “data flow” mode, as it is commonly known in the literature, is a data processing mode according to which the data entering the computation module are processed as and when they arrive, at the rate of their arrival, a result being provided as output from the computation module at the same rate, optionally after a latency time. Dedicated computation modules make it possible to comply with the fabrication cost constraints on account of their small silicon area and the performance constraints, notably as regards computational power and electrical consumption. However, such modules suffer from a flexibility problem, it not being possible for the processing operations supported to be modified after the construction of the modules. At the very best, these modules are parametrizable. Stated otherwise, a certain number of processing-related parameters may be modified after construction.
- A solution to this lack of flexibility consists in using completely programmable processors. The processors most commonly used are signal processors, well known in the literature under the acronym “DSP” for “Digital Signal Processor”. Drawbacks of these processors are their significant silicon footprint and their electrical consumption, often rendering them ill-adapted to highly constrained embedded applications.
- Compromises between dedicated computation modules and completely programmable processors are currently under development. According to a first compromise, a circuit comprises a data processing unit having very long instruction words, called a VLIW (“Very Long Instruction Word”) unit, and a unit making it possible to execute an instruction on several computation units, called an SIMD (“Single Instruction Multiple Data”) unit. In certain current constructions, computation units of VLIW and/or SIMD type are implanted in the circuit as a function of the necessary computational power. The choice of the type of unit to be included in the circuit, of their number and of the way they are chained together is decided before the construction of the circuit by analyzing the application code and necessary resources. The order in which the units are chained together is fixed and it does not make it possible to subsequently change the chaining of the processing operations. Moreover, the units are globally fairly complex since the control code for the application is not separate from the processing code. Thus, the processing operators of these units are of significant size, thereby leading to an architecture whose silicon area and electrical consumption are more significant for equal computational power.
- According to a second compromise, a C-language code may be transformed into a set of elementary instructions by a specific compiler. The set of instructions is then implanted on a configurable matrix of predefined operators. This technology may be compared with that of FPGA, which is the acronym for “Field Programmable Gate Array”, the computation grain being bigger. It does not therefore make it possible to obtain programmable circuits, but only circuits that can be configured by code compilation. If it is desired to integrate parts of program code that are not provided for at the outset, computation resources which are not present in the circuit are then necessary. It therefore becomes difficult or indeed impossible to implement this code.
- According to a third compromise, the data are processed by a so-called parallel architecture. Such an architecture comprises several computation tiles linked together by an interconnection bus. Each computation tile comprises a storage unit making it possible to store the data locally, a control unit providing instructions for carrying out processing on the stored data, processing units carrying out the instructions received from the control unit on the stored data and an input/output unit conveying the data either between the interconnection bus and the storage unit, or between the processing units and the interconnection bus. This architecture presents several advantages. A first advantage is the possibility of modifying the code to be executed by the processing units, even after the construction of the architecture. Furthermore, the code to be executed by the processing units generally comprises only computation instructions but no control or address computation instruction. A second advantage is the possibility of carrying out in parallel, either an identical processing on several data, or more complex processing operations for one and the same number of clock cycles by profiting from the parallel placement of the processing units. A third advantage is that the computation tiles may be chained together according to the processing operations to be carried out on the data, the interconnection bus conveying the data between the computation tiles in a configurable order. Moreover, the parallel architecture may be extended by adding further computation tiles, so as to adapt its processing capabilities to the processing operations to be carried out. However, the management of the data in the computation tiles is complex and generally requires significant memory resources. In particular, when a computation tile is performing a processing on a data neighborhood, all the data of this neighborhood must be available to it simultaneously, whereas the data arrive in the form of a continuous stream. The storage unit of the computation tile must then store a significant part of the data of the stream before being able to perform a processing on a neighborhood. This storage and the management of the stored data require optimization so as to limit the silicon area and the electrical consumption of the parallel architecture while offering computational performance adapted to the processing of a data flow.
- An aim of the invention is to propose a computation structure which is programmable and adapted to the processing of a data stream, notably when processing operations must be carried out on neighborhoods of data. For this purpose, the subject of the invention is a device for processing a data stream originating from a device generating matrices of NI rows by Nc columns of data. The processing device comprises K computation tiles and interconnection means for transferring the data stream between the computation tiles. At least one computation tile comprises:
-
- one or more control units making it possible to provide instructions,
- n processing units, each processing unit carrying out the instructions received from a control unit on a neighborhood of VI rows by Vc columns of data,
- a storage unit making it possible to place the data of the stream in the form of neighborhoods of VI rows by (n+Vc−1) columns of data, the storage unit comprising a block of shaping memories of dimension VI×Nc and a block of neighborhood registers of dimension VI×(n+Vc−1),
- an input/output unit making it possible to convey the data stream between the interconnection means and the storage unit on the one hand, and between the processing units and the interconnection means on the other hand.
- An advantage of the invention is that the storage unit of a computation tile in which a processing on a neighborhood of data is carried out is particularly adapted to such processing, notably in terms of dimensioning of the memory registers and management of accesses to the memory registers by the processing units.
- The invention will be better understood and other advantages will become apparent on reading the detailed description of an embodiment given by way of example, this description being offered in relation to appended drawings which represent:
-
FIG. 1 , an exemplary device for processing a data stream according to the invention, -
FIG. 2 , an exemplary processing unit comprising a very long instruction word processor, -
FIG. 3 , an exemplary management of a block of shaping memories, -
FIG. 4 , an exemplary management of a block of neighborhood registers in a case where the data of the block of shaping memories are in order, -
FIG. 5 , an exemplary management of the block of neighborhood registers in the case where the data of the block of shaping memories are not in order, -
FIG. 6 , a set of timecharts illustrating the temporal management of a block of neighborhood registers, -
FIG. 7 , an exemplary embodiment of a computation tile comprising several processing units in parallel, -
FIG. 8 , a set of timecharts illustrating the temporal management of a storage unit of a computation tile comprising two processing units in parallel, -
FIG. 9 , an exemplary embodiment of an input/ output unit, -
FIG. 10 , an exemplary implementation of the device according to the invention for video images, -
FIG. 11 , a schematic representation of a Bayer filter, -
FIG. 12 , an example of format registers allowing the splitting of the data of the stream, -
FIG. 13 , an exemplary mechanism allowing access to a register containing metadata, -
FIG. 14 , an exemplary embodiment of a computation tile comprising several processing units, the processing units receiving specific instructions as a function of metadata, -
FIG. 15 , an exemplary embodiment of an insertion operator. - The subsequent description is given in relation to a processing chain for a video data stream originating from a video sensor such as a CMOS sensor. The processing chain makes it possible for example to reconstruct color images on the basis of a monochrome video sensor to which is applied a color filter, for example a Bayer filter, to improve the quality of the images restored, or else to carry out morphological operations such as erosion/dilation or the low-level part in the processing of pixels of advanced applications such as image stabilization, red eye correction or the detection of faces. However, the device according to the invention may be equally suitable for processing a stream of data other than those arising from a video sensor. The device can for example process an audio data stream or data in Fourier space. Generally, the device exhibits particular interest for the processing of data which, although being conveyed in the form of a stream, possess a coherence in a two-dimensional space.
-
FIG. 1 schematically represents adevice 1 for processing a data stream according to the invention. Avideo sensor 2 generates a digital data stream directed toward theprocessing device 1, by way of adata bus 3. The data arising from thevideo sensor 2 are referred to as raw data. Thedevice 1 processes these raw data so as to generate as output data referred to as final data. To this end, thedevice 1 according to the invention comprises processing units UT, control units UC, storage units UM and input/output units UES grouped into K computation tiles TC. Thedevice 1 also comprises interconnection means 4 such asdata buses - These interconnection means 4 make it possible to transfer the data stream between the various computation tiles TC. Each computation tile TC comprises a storage unit UM, one or more control units UC, at least one processing unit UT per control unit UC and an input/output unit UES. The storage units UM make it possible to shape the data of the stream so that they can be processed by the processing units UT as a function of code instructions delivered by the control units UC. The input/output units UES make it possible to convey the data stream between the interconnection means 4 and the storage units UM on the one hand, and between the processing units UT and the interconnection means 4 on the other hand. In the example of
FIG. 1 , thedevice 1 comprises 4 computation tiles TC, the first and the fourth computation tile TC1 and TC4 each comprising a storage unit UM, a control unit UC, a processing unit UT and an input/output unit UES, the second computation tile TC2 comprising a storage unit UM, a control unit UC, two processing units UT and an input/output unit UES, and the third computation tile TC3 comprising a storage unit UM, two control units UC, two processing units UT per control unit UC and an input/output unit UES. Each computation tile TC makes it possible to carry out a function or a series of functions on the basis of code instructions. Within the framework of a video processing chain, each computation tile TC carries out for example one of the following functions: correction of the white balance, dematrixing, noise reduction, contour accentuation. The composition of a computation tile TC depends notably on the function or functions that it has to carry out. In particular, the number of control units UC making up a computation tile TC depends on the number of different processing operations having to be carried out simultaneously by the computation tile TC. Since each control unit UC within the computation tile TC is able to comprise its own code, a computation tile TC comprises for example as many control units UC as distinct processing operations to be carried out in parallel on the data. - The processing units UT may be more or less complex. In particular, they can comprise either simple dedicated operators, for example composed of logic blocks, or processors. Each processing unit UT is independent of the others and can comprise different operators or processors. The dedicated operators are for example multipliers, adders/ subtracters, assignment operators or shift operators. Advantageously, the processing units UT contain only the dedicated operators commonly used for the processing envisaged.
- A processing unit UT can also comprise a processor. In a first embodiment, the processor comprises a single arithmetic and logic unit. In a second embodiment, the processor is a very long instruction word (VLIW) processor. Such a processor can comprise several arithmetic and logic units. In a preferred variant, a VLIW processor comprises for example instruction decoders, now no arithmetic and logic units but only computation operators, a local memory and data registers. Advantageously, only the computation operators necessary for the execution of the computation codes to be carried out are implanted in the processor during its design. Thereafter, two or more of them may be used in the same cycle to perform distinct operations in parallel. The unused operators do not receive the clock signals. The electrical consumption of the processing units UT is thereby reduced. These advantageous characteristics have led to a particular embodiment, represented in
FIG. 2 . In this figure, the VLIW processor comprises two pathways. Stated otherwise, it can execute up to two instructions in one and the same clock cycle. The processor comprises afirst instruction decoder 21, asecond instruction decoder 22, a first set ofmultiplexers 23, a set ofcomputation operators 24, a second set ofmultiplexers 25, a set of data registers 26 and alocal memory 27. Theinstruction decoders multiplexers 23 direct data to be processed onto an input of one of thecomputation operators 24 and themultiplexers 25 direct the processed data to the data registers 26. The data registers 26 containing the processed data may be linked up with outputs of the processor. The size of the very long instruction words is for example 48 bits, i.e. 24 bits per pathway. Thecomputation operators 24 thus work in 24-bit precision. Within the framework of video processing and more particularly of image reconstruction on the basis of data arising from a video sensor, thecomputation operators 24 are advantageously two adders/subtracters, a multiplier, an assignment operator, an operator for writing to the local memory and a shift operator. - Still according to a particular embodiment, the execution of the instructions may be conditioned by setting a flag. The instruction can then be supplemented with a prefix indicating the execution condition. The flag is for example a bit of a register containing the result of an instruction executed during the previous clock cycle. This bit can correspond to the zero, sign or carry indicators of the register. At each instruction, the
instruction decoders instruction decoders - According to a particular embodiment, each instruction word is coded on 24 bits. The first 3 bits (
bits 0 to 2) can contain the instruction condition, the following two bits (bits 3 and 4) can code the mode of access to the datum, the sixth, seventh and eighth bits (bits 5 to 7) can code the identifier of the operation, the following four bits (bits 8 to 11) can designate the destination register, the following four bits (bits 12 to 15) can designate the source register and the last 8 bits (bits 16 to 23) can contain a constant. An exemplary programming using such a coding is given in the annex. - The
device 1 for processing a data stream comprises M control units UC, M lying between 1 and N, N being the number of processing units UT. In the case where the number M of control units UC is equal to the number N of processing units UT, each processing unit UT can have its own control unit UC. In the case where the number M of control units UC is less than the number N of processing units UT, then at least one computation tile TC comprises several processing units UT, as in the example ofFIG. 1 (TC2, TC3). A control unit UC of this computation tile TC then provides instructions to several processing units UT, these processing units UT being said to be in parallel. A control unit UC can comprise a memory making it possible to store the code instructions for the processing unit or units UT that it serves. A control unit UC can also comprise an ordinal counter, an instruction decoder and an address manager. - Within the framework of a processing of raw images obtained by a color filter, the address manager and the ordinal counter make it possible to apply a different processing as a function of the color of the current pixel. In particular, the code may be split up into code segments, each code segment comprising instructions for one of the colors of the filter. The address manager can indicate to the ordinal counter the color of the current pixel, for example red, green or blue. According to a particular embodiment, the address manager comprises a two-bit word making it possible to code up to four different colors or natures of pixels in a pixel neighborhood of size two by two. At each clock cycle, the ordinal counter is incremented by a shift value (offset) depending on the value of the word. The ordinal counter then makes it possible to point at the code segment corresponding to the color of the current pixel. The four shift values are determined during compilation of the code as a function of the number of instructions of each of the code segments. The use of an address manager and of an ordinal counter makes it possible to unburden the programmer and thus avoids his having to determine the nature of the current pixel per program. This management becomes automatic and allows a shorter execution time and simpler programming. In the particular case where the processed images are monochrome, the same instructions are applied to all the pixels. The shift values are then equal and determined so that the ordinal counter points at the first instruction after the initialization code.
- The
device 1 for processing a data stream also comprises K storage units UM, K lying between 1 and M. A computation tile TC can comprise several control units UC, as in the example ofFIG. 1 (TC3). The same data of the stream, or neighboring data, that are present in the storage unit UM can then be processed differently by the processing units UT of the computation tile, each control unit UC providing instructions to at least one processing unit UT. The main function of the storage units UM is to shape the data of the stream so as to facilitate access to these data by the processing units UT. - According to a first embodiment, a storage unit UM comprises an equal number of data registers to the number of processing units UT situated in the computation tile TC of the storage unit UM considered.
- According to a second embodiment, particularly adapted to the processing of video images, a storage unit UM shapes the data in the form of neighborhoods and manages access to the data when processing units UT are in parallel. Such a storage unit UM can comprise a first memory block called a block of shaping memories and a second memory block called a block of neighborhood registers. Since the storage units UM of the various computation tiles TC are independent of one another, the
device 1 for processing the data stream can comprise at one and the same time storage units UM according to the first embodiment and storage units UM according to the second embodiment. The second embodiment makes it possible to carry out processing operations on data neighborhoods. For a video image, a neighborhood may be defined as a mesh of adjacent pixels, this mesh generally being square or at least rectangular. A rectangular mesh may be defined by its dimension VI×Vc where VI is the number of pixels of the neighborhood row-wise and Vc is the number of pixels of the neighborhood column-wise. The block of shaping memories stores the data of the stream so that they can be copied in a systematic manner with each arrival of a new datum. The block of neighborhood registers allows access to the pixels of the current neighborhood by the processing unit or units UT of the computation tile considered. -
FIG. 3 illustrates, by ablock 31 of shaping memories represented at different time steps T, an exemplary management of theblock 31 for data corresponding to a stream of values of pixels originating from a device generating matrices of NI rows by Nc columns of data, such as avideo sensor 32. Thevideo sensor 32 has resolution Nc columns by NI rows of pixels. The resolution is for example VGA (640×480), “HD Ready” (1080×720) or “Full HD” (1920×1080). The pixels are dispatched and stored as and when they arrive to theblock 31 of shaping memories. Thisblock 31 is advantageously of dimension VIxNc so as to make it possible to generate neighborhoods of dimension VI×Vc. Stated otherwise, theblock 31 comprises VI×Nc memory cells arranged according to a mesh of VI rows and Nc columns. Usual values for VI are three, four, five, six or seven. Physically, theblock 31 can consist of one or more memory modules. Theblock 31 may be managed as a shift register. Stated otherwise, at each time step or clock cycle, the data are shifted so as to leave room for the new incoming datum. Advantageously, theblock 31 is managed as a conventional memory in such a way that the pixels are copied in their order of arrival. - In the latter case and in a first embodiment, a counter CPT that is incremented on each incoming datum is considered. Each new pixel coming from the data stream is then copied into a
cell 33 of theblock 31 of shaping memories situated in the row corresponding to E(CPT/Nc), where E(x) is the function returning the integer part of a number x, and in the column corresponding to the remainder of CPT/Nc. The counter CPT is reset to zero each time it reaches the value equal to VI×Nc. - In a second embodiment, a counter CPTC that is incremented after each incoming datum and a counter CPTL that is incremented each time the counter CPTC reaches the value Nc are considered. The counter CPTC is reset to zero each time it reaches the value Nc and the counter CPTL is reset to zero each time it reaches the value VI. Each new pixel coming from the data stream is then copied into the
cell 33 whose row index corresponds to the value CPTL and whose column index corresponds to the value CPTC. -
FIG. 4 illustrates an exemplary management of the block of neighborhood registers for data originating from theblock 31 of shaping memories. Theblock 34 of neighborhood registers comprises for example a number of neighborhood registers equal to VI×Vc. These neighborhood registers are arranged in the same manner as the neighborhood of pixels, that is to say they form a mesh of VI rows and Vc columns of registers. The copying of the data of theblock 31 of shaping memories to the neighborhood registers starts as soon as there is a number of data in theblock 31 equal to (VI−1)×Nc+1 . In the case of a neighborhood ofdimension 3×3, represented inFIG. 4 , the copying of the data thus starts when two rows of data plus one datum are present in theblock 31. In one embodiment, the data are copied at each clock cycle in groups of VI data of one and the same column. At a given time step, the index of the column to be copied is given by the value of CPTC. This column in fact comprises the last pixel that arrived in theblock 31. Advantageously, acolumn 35 of VI data registers is added to the neighborhood registers. Thiscolumn 35 makes it possible to disable accesses to the registers of theblock 34 by the processing units UT only during a single clock cycle, that of the shifting of the values in theblock 34. Otherwise, accesses are disabled at one and the same time during the shifting of the values and during the copying of the data from theblock 31. During a first clock cycle, the data of the column, indicated by the counter CPTC, of theblock 31 are copied into the registers of thecolumn 35. During a second clock cycle, all the data of theblock 34 and of thecolumn 35 are shifted by one column. Thus, for a neighborhood ofdimension 3×3, in one and the same clock cycle, the data of afirst column 341 are shifted toward asecond column 342, while the data of thiscolumn 342 are shifted toward athird column 343 and the data of thecolumn 35 are shifted toward thecolumn 341. - On account of the cyclic management of the
block 31, the data are not always stored in theblock 31 according to the order of the rows of thevideo sensor 32. In this case, the pixels must be copied into thecolumn 35 or, if appropriate, into thecolumn 341 of theblock 34, in a different order.FIG. 5 illustrates such a case where the last data of the stream are stored in the first row of theblock 31. In the case of a neighborhood ofdimension 3×3, the copying of the pixels intocolumn 35 may be managed by the following placement steps: -
- the last arriving pixel always goes in the
third row 347 of thecolumn 35 of the neighborhood registers; - if the counter CPTL is equal to zero, stated otherwise if the last pixel has arrived at the
first row 311 of theblock 31, then- the pixel of the
second row 312 of theblock 31 is copied to thefirst row 345 of thecolumn 35, - the pixel of the
third row 313 of theblock 31 is copied to thesecond row 346 of thecolumn 35;
- the pixel of the
- if the counter CPTL is equal to one, stated otherwise if the last pixel has arrived at the
second row 312 of theblock 31, then- the pixel of the first row of the
block 31 is copied to thesecond row 346 of thecolumn 35, - the pixel of the
third row 313 of theblock 31 is copied to thefirst row 345 of thecolumn 35;
- the pixel of the first row of the
- if the counter CPTL is equal to two, stated otherwise if the last pixel has arrived at the
third row 313 of theblock 31, then- the pixel of the
first row 311 of theblock 31 is copied to thefirst row 345 of thecolumn 35, - the pixel of the
second row 312 of theblock 31 is copied to thesecond row 346 of thecolumn 35.
- the pixel of the
- the last arriving pixel always goes in the
- More generally, in the case of a neighborhood of size VI×Vc, the pixel of the
block 31 of shaping memories which is situated in the row RowNo and in the column indicated by CPTC is notably copied tocolumn 35, or, if appropriate, into thefirst column 341 of theblock 34, in the row defined by (CPTL+RowNo+1) modulo VI. RowNo takes all the positive integer values lying between 1 and VI so as to allow the copying of the pixels for all the rows of the neighborhood. - According to a particular embodiment, the copying of the pixels of the
block 31 into thecolumn 35 of registers is not performed simultaneously with the shifting of the pixels in theblock 34. This embodiment allows the processing units UT to access the data present in theblock 34 of neighborhood registers during a longer period.FIG. 6 represents a set of timecharts making it possible to implement this embodiment. The temporal shift between the copying of the pixels and the shifting of the pixels into theblock 34 may be effected by introducing, in addition to a first clock, called thepixel clock 61 and making it possible to regulate the data stream and the copying of the pixels, a second clock, called the shiftedpixel clock 62. This shiftedpixel clock 62 may be at the same frequency as thepixel clock 61 but shifted in time. This shift corresponds for example to a period of the clock of theprocessing units UT 63. The data present in theblock 34 are then accessible throughout the period separating two clock ticks of the shiftedpixel clock 62. Access to the neighborhood registers by the processing units UT may be effected through an input/output port, for example integrated into each processing unit UT, the number of whose connections is equal to the number of neighborhood registers, multiplied by the size of the data. Each neighborhood register is linked to the input/output port. Advantageously, each storage unit UM comprises a multiplexer the number of whose inputs is equal to the number of neighborhood registers of theblock 34 and the number of whose outputs is equal to the number of data that can be processed simultaneously by the processing unit UT of the computation tile TC considered. The processing unit UT can then comprise an input/output port the number of whose connections is equal to the number of data that can be processed simultaneously, multiplied by the size of the data. In this instance, a processing unit UT comprising a VLIW processor with two pathways processing data on 12 bits can comprise an input/output port with 24 (2×12) connections. - According to a particular embodiment, one and the same storage unit UM provides data to several processing units UT in parallel. Stated otherwise, the
processing device 1 comprises a computation tile TC comprising several processing units UT. This embodiment advantageously uses the storage units UM comprising ablock 31 of shaping memories and ablock 34 of neighborhood registers. However, the dimension of theblock 34 of neighborhood registers has to be adapted.FIG. 7 illustrates an exemplary computation tile TC where a storage unit UM provides data to n processing units UT in parallel, n being less than or equal to the number N of processing units UT of thedevice 1. The instructions are provided to the n processing units UT by one or more control units UC. According to this embodiment, theblock 34 of neighborhood registers is of dimension VI×(n+Vc−1). Stated otherwise, theblock 34 comprises VI×(n+Vc−1) data registers arranged according to a mesh of VI rows and n+Vc−1 columns. For example, for three processing units UT in parallel and a neighborhood ofdimension 5×5, a mesh of 7 (=3+5−1) columns and 5 rows of registers are necessary. Moreover, acolumn 35 of VI data registers may be added to theblock 34. Thus, access to the neighborhood registers by the processing units UT is disabled only during a single cycle of the processing units UT. The copying of the data of theblock 31 to thecolumn 35 of registers then starts when theblock 31 of shaping memories comprises (VI−1)×Nc+1 data. Moreover, for n processing units UT in parallel, the processing of the data is carried out when n new data have arrived in theblock 31. Access to the neighborhood registers by the n processing units UT can also be carried out through an input/output port integrated into each processing unit UT. The number of connections of the input/output port of each processing unit UT is then equal to the number of neighborhood registers to which the processing unit UT requires access, multiplied by the size of the data. Likewise, the storage unit UM can comprise a multiplexer the number of whose inputs is equal to the number of neighborhood registers of theblock 34 and the number of whose outputs is equal to the number of data that can be processed simultaneously by the n processing units UT, each processing unit UT comprising an input/output port the number of whose connections is equal to the number of data that can be processed simultaneously by said processing unit UT, multiplied by the size of the data. -
FIG. 8 illustrates, by a set of timecharts, an exemplary management of a computation tile TC comprising two processing units UT in parallel. Afirst timechart 81 represents the clock of the processing units UT of rate Farchi. Asecond timechart 82 represents the pixel clock of rate Fpixel. The pixel clock fixes the rate of arrival of the data of the stream which are dispatched into theblock 31 of shaping memories. The rate Farchi may be equal to p×Fpixel with p a positive integer. According toFIG. 8 , the rate Fpixel is four times greater than the rate Farchi. Each processing unit UT thus has four clock cycles per datum to be processed. Athird timechart 83 represents a shift clock. This clock generates two successive clock ticks 831, 832 after one clock tick out of two of the pixel clock. At each clock tick of the shift clock, the data of theblock 34 are shifted by one column. Afourth timechart 84 represents the shifted pixel clock. The rate of this clock is substantially equal to half the rate aclock tick 840 being generated after the two clock ticks 831, 832 of the shift clock. Generally, the rate of the shifted pixel clock is equal to 1/n times the rate Fpixel of the pixel clock. At each clock tick 840 of the shifted pixel clock, the data are copied from theblock 31 to theblock 35. Access to the neighborhood registers by the processing units UT is possible between two clock ticks 840 of the shifted pixel clock. - According to a particular embodiment, the interconnection means 4 comprise a number Nb_bus of data buses. Nb_bus may be defined by the following relation:
-
Nb_bus=K×(F pixel /F archi)+1. - This embodiment makes it possible to connect the K computation tiles TC to one another by carrying out a spatiotemporal multiplexing whose temporal multiplexing ratio Mux_t is defined by the relation:
-
Mux — t=F archi /F pixel. - The temporal multiplexing ratio Mux_t makes it possible to define an equal number of time intervals, the read-access and write-access authorizations possibly being defined for each time interval. For example, for a rate Fpixel equal to 50 MHz and a rate Farchi of 200 MHz, the four computation tiles TC of
FIG. 1 may be chained in an arbitrary order if the interconnection means 4 comprise a minimum of two (4×(50/200)+1) data buses, the computation tiles TC being addressed by a temporal multiplexing of ratio four (=200/50). - According to this embodiment, each input/output unit UES can manage the read-access and write-access authorizations as a function of the number Nb_bus of buses and of the temporal multiplexing ratio Mux_t. In particular, each input/output unit UES can comprise registers making it possible to determine the time intervals during which the computation tile TC considered has a read-access or write-access authorization on one of the data buses and, for each of these time intervals, the data bus for which the read- or write-access is authorized. An input/output unit UES comprises for example, for the management of the write-access authorizations, Nb_bus registers of size log2(Mux_t) bits, where log2(x) is the function returning the logarithm to
base 2 of the number x and, for the management of the read-access authorizations, a register of size log2(Nb_bus) bits specifying the index of the bus to be read and a register of size log2(Mux_t) bits specifying the time interval. An exemplary embodiment of such an input/output unit UES is represented inFIG. 9 . The input/output unit UES comprises tworegisters register 91 managing the write-access authorization on thebus 41 and theregister 92 managing the write-access authorization on thebus 42. The content of theregisters comparators bus register 95 of 1 bit specifying the index of thebus register 96 of 2 bits specifying the time interval for the reading. The content of theregister 96 is also compared with the current time interval, for example by acomparator 97 and, in the case of equality, the reading of the data is authorized on thebus buses bus - According to a particular embodiment, represented in
FIG. 1 , each computation tile TC furthermore comprises a serial block BS comprising as many data registers as processing units UT present in the tile considered, the size of the registers being of size at least equal to the size of the data of the stream. The serial block BS of a computation tile TC receives as input the data originating from the processing unit or units UT and is connected at output to the input/output unit UES. During a write-authorization on one of thebuses bus -
FIG. 10 illustrates an exemplary implementation of the datastream processing device 1 for processing operations to be carried out on raw images. The raw images arise for example from aBayer filter 110, for example represented inFIG. 11 . With such a filter, a color image consists of a mosaic of red, green and blue colored pixels. In particular, the mosaic consists of an alternation of blue and green pixels on a first type of row and of an alternation of green and red pixels on a second type of row, the types of rows also being alternated so as to form diagonals of green pixels. Thedevice 1 according to the invention is particularly adapted to such data. Indeed, for each type of row, it is possible to construct a computation tile TC capable of simultaneously processing several pixels although they are of different color. In one embodiment, represented inFIG. 10 , the computation tile TC comprises, on the one hand, a first control unit UC1 providing a first code to a first and to a third processing unit UT1 and UT3 and, on the other hand, a second control unit UC2 providing a second code to a second and to a fourth processing unit UT2 and UT4. The first code is specific to a first color of pixel, for example red, and the second code is specific to a second color of pixel, for example green. The code can also be split up into code segments, an address manager then indicating the color of the processed pixel to the control units UC1 and UC2. The processing units UT1, UT2, UT3 and UT4 then act on the data present in theblock 34 of neighborhood registers as a function of the instructions that they receive. In this instance, the first and third processing units UT1 and UT3 act on red pixels and the second and fourth processing units UT2 and UT4 act on the green pixels. The computation tile thus makes it possible to process simultaneously, but distinctly, four pixels of theblock 34 of neighborhood registers. In the case of theBayer filter 110, two control units UC1 and UC2 per row suffice since a row comprises only two different colors. Quite obviously, the computation tiles may be adapted as a function of the color filter applied. - According to a particular embodiment, the data in transit on the interconnection means 4 can contain the datum to be processed, but also additional information, called metadata. These metadata may be used for the transport of various information associated with the data. Within the framework of video image processing, where the data relate to pixels, a metadatum contains for example a value representative of a noise correction or gain correction to be applied to the pixels. The same correction can thus be applied to all the pixels of the image. The metadata can also relate to the three values R, G and B (Red, Green and Blue), intermediate results to be associated with a pixel, or else information making it possible to control the program as a function of the characteristics of the pixel. The use of these metadata makes it possible to easily split the algorithms into several parts and to execute them on various computation tiles TC in multi-SIMD mode. The data in transit may be split according to various formats. The splitting format is specified in format registers Mri, as represented in
FIG. 12 (four in number in thisparticular case M r 0,M r 1,M r 2, Mr 3). In the case presented inFIG. 12 , a 24-bit data word has been divided into threeparts FIG. 13 , composed ofmultiplexers shift registers logical gates FIG. 13 are the format registers Mri 134 associated with position registers Dri 135. These registers Dri 135 are deduced from the registers Mri 134 by the software for loading the parameters of the program. They make it possible to obtain the start position of the metadatum considered. Only the format registers Mri 134 are given by the programmer, the position registers Dri 135 being obtained automatically on the basis of the format registers Mri 134. In the example ofFIG. 13 ,M r 0 is associated withD r 0=0,M r 1 is associated withD r 1=8,M r 2 is associated withD r 2=16 andM r 3 is associated withD r 1=24. A possible implementation of a mechanism for reading and writing the metadata in a 24-bit register 136, allowing access to the communication network, is given inFIG. 13 . - For the metadata reading part, the format registers 134 are linked by a
multiplexer 131 b controlled by the current position to be recovered CMr to a logical ANDcell 133 d making it possible to set to zero the bits of theregister 136 for which the position defined is not relevant, and then the position registers 135 are linked to ashift register 132 b which makes it possible to shift the previous result the appropriate number of times so as to have a right-shifted value which is the final value to be recovered for the position considered. - For the metadata writing part, the datum to be written VaI is shifted to the appropriate position by virtue of the position register 135 linked to a
multiplexer 131 c controlled by the current position to be written CMw, which multiplexer 131 c is itself linked to ashift register 132 a. Only the bits concerned are not set to zero by virtue of a logical ANDcell 133 a linked to themultiplexer 131 a which gives the bits to be masked by virtue of theformat register 134. Finally these bits are concatenated with those already present in thedestination register 136. For this purpose, theformat register 134 is inverted by alogical NOT cell 137 and then a logical AND is carried out by acell 133 b between the inverted format register and the value of theregister 136, thus making it possible to fabricate a new mask which drives a logical ORcell 133 c making it possible to associate the new datum with theregister 136 without touching the bits that are not relevant. - With this management of the metadata, the computation of complex or fixed-point numbers is greatly eased. The processing device according to the invention can thus be advantageously used for the processing operations in the Fourier domain (frequency domain) for example. The processing device according to the invention can also be adapted for accelerating the emulation of floating-point numbers with the aid of fixed-point operators.
- In a multi-SIMD architecture, the metadata may be used to determine the instructions to be transmitted to the processors. Indeed, the complementary information (metadata) can indicate the specific processing to be executed on the data with which they are associated. It suffices to extract the necessary information from the data word as soon as it enters the computation tile TC, illustrated by the example of
FIG. 14 , and to transmit it to the control unit UC which manages the processing units UT in multi-SIMD mode. In this instance, the metadatum may be extracted when the neighborhood manager organizes the data so as to transmit them to the processing units UT viamultiplexers 141. Anadditional communication 142 between the input/output unit UES and the control unit UC allows the transfer of the metadatum, as represented inFIG. 15 . - In order to allow the transfer of the metadata, the interconnection means 4 may be adapted in terms of capacity. In particular, the size of the
data buses device 1 according to the invention can comprise an insertion operator making it possible to concatenate each datum of the stream with a metadatum.FIG. 15 represents such aninsertion operator 150. Theinsertion operator 150 comprises aninput bus 151 linked to an input of aninsertion block 152 whose output is linked to an output bus 153. Theinsertion operator 150 can also comprise amemory 154 making it possible to store the metadatum. Thememory 154 is linked with theinsertion block 152 so as to allow the transfer of the metadatum. The size of this datum must be less than or equal to the difference between the maximum size of the data that can be transferred by the interconnection means 4 and the size of the data of the stream. The size of theinput bus 151 must be adapted to the size of the data of the stream whereas the size of the output bus 153 must be adapted to the size of the data of the stream that are concatenated with the metadatum. Theinsertion operator 150 may be inserted on one of thedata buses video sensor 2 and the computation tiles TC or between two computation tiles TC. In one embodiment, theinsertion operator 150 is embodied by a computation tile TC. The computation tile TC then comprises a storage unit UM containing the complementary datum and a processing unit UT making it possible to concatenate the data of the stream with the complementary datum. The complementary datum is for example stored in a data register of the storage unit UM. This embodiment presents the advantage of avoiding the insertion of an additional component into the chain for processing the data stream. It is made possible by virtue of the modularity of thedevice 1 according to the invention. - Instruction set (on 48 bits: 24 bits per pathway)
- Composition of the 24-bit instruction word:
-
0 . . . 2 - > 3 Condition bits; 3 . . . 4 - > 2 Data access mode bits 5 . . . 7 - > 3 bits Identifying the operation 8 . . . 11 - > 4 Destination Register bits 12 . . . 15 - > 4 Source Register bits 16 . . . 23 - > 8 constant bits - Prefix of the instructions:
-
F_: Execution if flag=1 NF_: Execution if flag=0 fC: Update of the flag on Carry fZ: Update of the flag on Result at ZERO fS: Update of the flag on Sign (1 if >0, 0 if <0) - Postfix of the instructions:
- Allows choice of the source
-
r (D, A, B): R[D] destination register; R[A] source register, R[B] source register c (D, A, C): R[D] destination register; R[A] source register, C Constant v (D, A, V): R[D] destination register; R[A] source register, Neighbor[V] If the 8 bits of the argument B are formed in the following manner: “10...0<V>” instead of “0..<V>” “the value stored by the register V will be taken as neighbor i.e. Neighbor[R[V]] m (D, A,M): R[D] destination register; R[A] source register, M Memory address of the local memory - Use:
- The VLIW allows work on two pathways:
-
OPi(Dm...) / OPj(Dn) - However, we must have j!=i and m!=n
- Except in the case of the complementary conditional instructions:
-
F_OP; No operation if FLAG = 1 NF_OP; No operation if FLAG = 0 - We can therefore write on the same row
-
F_OP(Dm...) / NF_OP(Dn...) - Since whatever the value of the Flag, only one will actually be executed
- List of operations
-
- NOP PREFIX: F— NF— - NOP F_NOP NF_NOP - LD PREFIX: F— NF— fZ— fC— fS— LDr(D, A): R[D] = R[A] LDc(D, C): R[D] = C; C signed constant LDv(D, V): R[D] = Neig[V] LDv(D, V); R[D] = Neig[R[V]] LDm(D, M); R[D] = SP[M] - ADD; SUB; MUL PREFIX: F— NF— fZ— fC— fS— - Two signed adders are available, we therefore have ADD0 and ADD1 usable simultaneously without restriction on the pathway.
-
PREFIX: F— NF— fZ— fC— fS— ADD0r(D, A, B): R[D] = R[A] + R[B] ADD0c(D, A, C): R[D] = R[A] + C; C signed constant ADD0v(D, A, V): R[D] = R[A] + Neig[V] ADD0v(D, A, −V); R[D] = R[A] + Neig[R[V]] ADD0m(D, A, M); R[D] = R[A] + SP[M] - DITTO for ADD0, SUB0 SUB1, and MUL; all these operations are signed
-
- SHIFT PREFIX: F— NF— fZ— fC— fS— - A signed shift operator makes it possible to shift the values to the right or left according to the sign of the shift
- A shift of 0 is equivalent to an assignment
-
SHIFTc(D,A,C): if (C > 0) R[D] = R[A] << C if (C < 0) R[D] = R[A] >> C - INV INVr(D,A): R[D] = −R[A] INVc(D,C): R[D] = −C INVv(D,V): R[D] = −Neig[V] INVv(D,−V): R[D] = −Neig[R[V]] INVm(D,M): R[D] = −SP[M] - Program code example
- This code carries out the following operation: For a 2×2 neighborhood, sets R1 to the average of the pixels of the neighborhood and sets R0 to 255 if the value of the average is >128; increments R2 if the pixel is at 255 (to have the count of the pixels >128 at the end of the processing)
-
1 #include “macros.h” 2. initcode 3 LDc(R0,0); / NOP 4 LDc(R1,0); / NOP 5 LDc(R2,0); / NOP 6 LDc(R3,0); / NOP 7 LDc(R4,0); / NOP 8 LDc(R5,0); / NOP 9 NOP / NOP 10.pixelcode0 11 LDv(R1,V0) / NOP 12 LDv(R2,V1) / ADD0v(R1,R1,V0) 13 ADD0v(R2,R2,V2) / ADD1v(R1,R1,V3) 14 ADD0(R1,R1,R2) / NOP 15 SHIFTc(R1,R1,−2) / NOP 16 fS_SUBc(R7,R1,128) / NOP 17 F_LDc(R0,0) / NF_LDc(R0,255); in this exceptional case it is possible to call LD 2x since one is executed and not the other 18 NF_ADD0v(R2,R2,1) / NOP 19 NOP /NOP 20.pixelcode1 21.pixelcode2 22.pixelcode3 - It would be possible to write while making maximum benefit of the 2 pathways of the VLIW:
-
10. pixelcode0 11 LDv(R1,V0) / NF_ADD0v(R2,R2,1) 12 LDv(R2,V1) / ADD0v(R1,R1,V0) 13 ADD0v(R2,R2,V2) / ADD1v(R1,R1,V3) 14 ADD0(R1,R1,R2) / NOP 15 SHIFTc(R1,R1,−2) / NOP 16 fS_SUBc(R7,R1,128) / NOP 17 F_LDc(R0,0) / NF_LDc(R0,255)
Claims (16)
Nb_bus=Kx(F pixel /F archi)+1,
p_x_Fpixel
Mux — t=F archi /F pixel.
Mux — t=F archi/Fpixel.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0805369A FR2936626B1 (en) | 2008-09-30 | 2008-09-30 | DEVICE FOR PARALLEL PROCESSING OF A DATA STREAM |
FR0805369 | 2008-09-30 | ||
PCT/EP2009/057033 WO2010037570A1 (en) | 2008-09-30 | 2009-06-08 | Device for the parallel processing of a data stream |
Publications (2)
Publication Number | Publication Date |
---|---|
US20110273459A1 true US20110273459A1 (en) | 2011-11-10 |
US8836708B2 US8836708B2 (en) | 2014-09-16 |
Family
ID=40350007
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/121,417 Expired - Fee Related US8836708B2 (en) | 2008-09-30 | 2009-06-08 | Device for the parallel processing of a data stream |
Country Status (5)
Country | Link |
---|---|
US (1) | US8836708B2 (en) |
EP (1) | EP2332067A1 (en) |
JP (1) | JP2012504264A (en) |
FR (1) | FR2936626B1 (en) |
WO (1) | WO2010037570A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140019729A1 (en) * | 2012-07-10 | 2014-01-16 | Maxeler Technologies, Ltd. | Method for Processing Data Sets, a Pipelined Stream Processor for Processing Data Sets, and a Computer Program for Programming a Pipelined Stream Processor |
US9191241B2 (en) | 2012-03-13 | 2015-11-17 | Commissariat A L' Energie Atomique Et Aux Energies Alternatives | Method for acquiring and processing signals |
US9250963B2 (en) | 2011-11-24 | 2016-02-02 | Alibaba Group Holding Limited | Distributed data stream processing method and system |
US20170110093A1 (en) * | 2015-10-19 | 2017-04-20 | Yahoo! Inc. | Computerized system and method for automatically creating and applying a filter to alter the display of rendered media |
US20190044876A1 (en) * | 2018-06-30 | 2019-02-07 | Intel Corporation | Scalable packet processing |
CN109542985A (en) * | 2018-11-27 | 2019-03-29 | 江苏擎天信息科技有限公司 | A kind of general streaming Data Analysis Model and its construction method |
US20190294443A1 (en) * | 2018-03-20 | 2019-09-26 | Qualcomm Incorporated | Providing early pipeline optimization of conditional instructions in processor-based systems |
US11645226B1 (en) | 2017-09-15 | 2023-05-09 | Groq, Inc. | Compiler operations for tensor streaming processor |
US11809514B2 (en) | 2018-11-19 | 2023-11-07 | Groq, Inc. | Expanded kernel generation |
US11868908B2 (en) | 2017-09-21 | 2024-01-09 | Groq, Inc. | Processor compiler for scheduling instructions to reduce execution delay due to dependencies |
US11868804B1 (en) | 2019-11-18 | 2024-01-09 | Groq, Inc. | Processor instruction dispatch configuration |
US11868250B1 (en) | 2017-09-15 | 2024-01-09 | Groq, Inc. | Memory design for a processor |
US11875874B2 (en) | 2017-09-15 | 2024-01-16 | Groq, Inc. | Data structures with multiple read ports |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10942673B2 (en) | 2016-03-31 | 2021-03-09 | Hewlett Packard Enterprise Development Lp | Data processing using resistive memory arrays |
WO2019212466A1 (en) | 2018-04-30 | 2019-11-07 | Hewlett Packard Enterprise Development Lp | Resistive and digital processing cores |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070242074A1 (en) * | 1999-04-09 | 2007-10-18 | Dave Stuttard | Parallel data processing apparatus |
US20090187736A1 (en) * | 2005-10-26 | 2009-07-23 | Cortica Ltd. | Computing Device, a System and a Method for Parallel Processing of Data Streams |
US7711938B2 (en) * | 1992-06-30 | 2010-05-04 | Adrian P Wise | Multistandard video decoder and decompression system for processing encoded bit streams including start code detection and methods relating thereto |
US7788468B1 (en) * | 2005-12-15 | 2010-08-31 | Nvidia Corporation | Synchronization of threads in a cooperative thread array |
US7912889B1 (en) * | 2006-06-16 | 2011-03-22 | Nvidia Corporation | Mapping the threads of a CTA to the elements of a tile for efficient matrix multiplication |
US8250555B1 (en) * | 2007-02-07 | 2012-08-21 | Tilera Corporation | Compiling code for parallel processing architectures based on control flow |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6041400A (en) * | 1998-10-26 | 2000-03-21 | Sony Corporation | Distributed extensible processing architecture for digital signal processing applications |
US6961084B1 (en) * | 1999-10-07 | 2005-11-01 | Ess Technology, Inc. | Programmable image transform processor |
US7196708B2 (en) * | 2004-03-31 | 2007-03-27 | Sony Corporation | Parallel vector processing |
-
2008
- 2008-09-30 FR FR0805369A patent/FR2936626B1/en active Active
-
2009
- 2009-06-08 WO PCT/EP2009/057033 patent/WO2010037570A1/en active Application Filing
- 2009-06-08 US US13/121,417 patent/US8836708B2/en not_active Expired - Fee Related
- 2009-06-08 JP JP2011528264A patent/JP2012504264A/en active Pending
- 2009-06-08 EP EP09779672A patent/EP2332067A1/en not_active Withdrawn
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7711938B2 (en) * | 1992-06-30 | 2010-05-04 | Adrian P Wise | Multistandard video decoder and decompression system for processing encoded bit streams including start code detection and methods relating thereto |
US20070242074A1 (en) * | 1999-04-09 | 2007-10-18 | Dave Stuttard | Parallel data processing apparatus |
US20090187736A1 (en) * | 2005-10-26 | 2009-07-23 | Cortica Ltd. | Computing Device, a System and a Method for Parallel Processing of Data Streams |
US7788468B1 (en) * | 2005-12-15 | 2010-08-31 | Nvidia Corporation | Synchronization of threads in a cooperative thread array |
US7912889B1 (en) * | 2006-06-16 | 2011-03-22 | Nvidia Corporation | Mapping the threads of a CTA to the elements of a tile for efficient matrix multiplication |
US8250555B1 (en) * | 2007-02-07 | 2012-08-21 | Tilera Corporation | Compiling code for parallel processing architectures based on control flow |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9250963B2 (en) | 2011-11-24 | 2016-02-02 | Alibaba Group Holding Limited | Distributed data stream processing method and system |
US9727613B2 (en) | 2011-11-24 | 2017-08-08 | Alibaba Group Holding Limited | Distributed data stream processing method and system |
US9191241B2 (en) | 2012-03-13 | 2015-11-17 | Commissariat A L' Energie Atomique Et Aux Energies Alternatives | Method for acquiring and processing signals |
US9514094B2 (en) * | 2012-07-10 | 2016-12-06 | Maxeler Technologies Ltd | Processing data sets using dedicated logic units to prevent data collision in a pipelined stream processor |
US20140019729A1 (en) * | 2012-07-10 | 2014-01-16 | Maxeler Technologies, Ltd. | Method for Processing Data Sets, a Pipelined Stream Processor for Processing Data Sets, and a Computer Program for Programming a Pipelined Stream Processor |
US10621954B2 (en) | 2015-10-19 | 2020-04-14 | Oath Inc. | Computerized system and method for automatically creating and applying a filter to alter the display of rendered media |
US20170110093A1 (en) * | 2015-10-19 | 2017-04-20 | Yahoo! Inc. | Computerized system and method for automatically creating and applying a filter to alter the display of rendered media |
US9905200B2 (en) * | 2015-10-19 | 2018-02-27 | Yahoo Holdings, Inc. | Computerized system and method for automatically creating and applying a filter to alter the display of rendered media |
US11645226B1 (en) | 2017-09-15 | 2023-05-09 | Groq, Inc. | Compiler operations for tensor streaming processor |
US11822510B1 (en) * | 2017-09-15 | 2023-11-21 | Groq, Inc. | Instruction format and instruction set architecture for tensor streaming processor |
US11868250B1 (en) | 2017-09-15 | 2024-01-09 | Groq, Inc. | Memory design for a processor |
US11875874B2 (en) | 2017-09-15 | 2024-01-16 | Groq, Inc. | Data structures with multiple read ports |
US11868908B2 (en) | 2017-09-21 | 2024-01-09 | Groq, Inc. | Processor compiler for scheduling instructions to reduce execution delay due to dependencies |
US20190294443A1 (en) * | 2018-03-20 | 2019-09-26 | Qualcomm Incorporated | Providing early pipeline optimization of conditional instructions in processor-based systems |
US10652162B2 (en) * | 2018-06-30 | 2020-05-12 | Intel Corporation | Scalable packet processing |
US20190044876A1 (en) * | 2018-06-30 | 2019-02-07 | Intel Corporation | Scalable packet processing |
US11809514B2 (en) | 2018-11-19 | 2023-11-07 | Groq, Inc. | Expanded kernel generation |
CN109542985A (en) * | 2018-11-27 | 2019-03-29 | 江苏擎天信息科技有限公司 | A kind of general streaming Data Analysis Model and its construction method |
US11868804B1 (en) | 2019-11-18 | 2024-01-09 | Groq, Inc. | Processor instruction dispatch configuration |
Also Published As
Publication number | Publication date |
---|---|
WO2010037570A1 (en) | 2010-04-08 |
EP2332067A1 (en) | 2011-06-15 |
FR2936626B1 (en) | 2011-03-25 |
JP2012504264A (en) | 2012-02-16 |
FR2936626A1 (en) | 2010-04-02 |
US8836708B2 (en) | 2014-09-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8836708B2 (en) | Device for the parallel processing of a data stream | |
US8069334B2 (en) | Parallel histogram generation in SIMD processor by indexing LUTs with vector data element values | |
US9201899B2 (en) | Transposition operation device, integrated circuit for the same, and transposition method | |
JP4733894B2 (en) | Parallel data processing and shuffling | |
US10509652B2 (en) | In-lane vector shuffle instructions | |
US5055997A (en) | System with plurality of processing elememts each generates respective instruction based upon portions of individual word received from a crossbar switch | |
US8412725B2 (en) | Method for processing an object on a platform having one or more processors and memories, and platform using same | |
US6961084B1 (en) | Programmable image transform processor | |
US6209078B1 (en) | Accelerated multimedia processor | |
US20090013151A1 (en) | Simd type microprocessor | |
KR102305470B1 (en) | Image signal processing device performing image signal processing in parallel through plurality of image processing channels | |
US4543642A (en) | Data Exchange Subsystem for use in a modular array processor | |
US20190235863A1 (en) | Sort instructions for reconfigurable computing cores | |
US8952976B2 (en) | SIMD parallel processor architecture | |
US7783861B2 (en) | Data reallocation among PEs connected in both directions to respective PEs in adjacent blocks by selecting from inter-block and intra block transfers | |
US5987488A (en) | Matrix processor | |
US20100070719A1 (en) | Slave and a master device, a system incorporating the devices, and a method of operating the slave device | |
US7996657B2 (en) | Reconfigurable computing circuit | |
JP7242235B2 (en) | Image processing device and image processing method | |
US20130212355A1 (en) | Conditional vector mapping in a SIMD processor | |
JP2024015829A (en) | Image processing apparatus and image processing circuit | |
JP2004062401A (en) | Arithmetic processor and camera device using it | |
CN113867790A (en) | Computing device, integrated circuit chip, board card and computing method | |
JPH0616293B2 (en) | Image processing device | |
JP2007172528A (en) | Signal processing processor and imaging apparatus using the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: COMMISSARIAT A L'ENERGIE ATOMIQUE ET AUX ENERGIES Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LETELLIER, LAURENT;THEVENIN, MATHIEU;SIGNING DATES FROM 20110305 TO 20110330;REEL/FRAME:026337/0101 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.) |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Expired due to failure to pay maintenance fee |
Effective date: 20180916 |