[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108664322A - Data processing method and system - Google Patents

Data processing method and system Download PDF

Info

Publication number
CN108664322A
CN108664322A CN201710198858.7A CN201710198858A CN108664322A CN 108664322 A CN108664322 A CN 108664322A CN 201710198858 A CN201710198858 A CN 201710198858A CN 108664322 A CN108664322 A CN 108664322A
Authority
CN
China
Prior art keywords
data
calculating
processing
unit
polymerization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710198858.7A
Other languages
Chinese (zh)
Inventor
周海建
杨凌云
刘振鲁
陈力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangdong Shenma Search Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Shenma Search Technology Co Ltd filed Critical Guangdong Shenma Search Technology Co Ltd
Priority to CN201710198858.7A priority Critical patent/CN108664322A/en
Publication of CN108664322A publication Critical patent/CN108664322A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of data processing method of present invention offer and system, the data processing system include the data storage cell, processing unit and calculating service unit of relatively independent operation;The method includes:The processing unit reads the pending data in the data storage cell one by one;According to preset calculating task, judge whether the pending data needs to carry out polymerization calculating;When polymerization calculating need not be carried out, it is sent to the calculating service unit by the processing unit or by the processing data and is commonly calculated;When carrying out polymerization calculating, which is sent to calculating service unit and carries out polymerization calculating.By carrying out relatively independent storage, processing and calculating to data, and using the method converted batch processed for Stream Processing, the scalability of the data processing system is enhanced, and improves the data processing system when carrying out data processing to the utilization rate of computing resource.

Description

Data processing method and system
Technical field
The present invention relates to technical field of data processing, in particular to a kind of data processing method and system.
Background technology
With the development of digital technology, the side for generally requiring to be calculated using batch when data processing is carried out to high-volume data Method handles data.The batch computing system of the prior art needs the finely human configuration with complexity, the work(of task process Energy and performance complexity are all bigger.When storage and calculation scale vary widely, need to various resource parameters Complicated adjustment is done, the Performance tuning in complex scene is very difficult, results in the scalability of entire data processing system Difference.
Invention content
In order to overcome above-mentioned deficiency in the prior art, the purpose of the present invention is to provide a kind of data processing methods, answer For data processing system, the data processing system includes the data storage cell, processing unit and meter of relatively independent operation Service unit is calculated, the sequence of data of the data storage cell supporting is read and random writing;The method includes:
The processing unit reads the pending data in the data storage cell one by one;
According to preset calculating task, judge whether the pending data needs to carry out polymerization calculating, wherein described poly- Total calculate includes the calculating for needing that a plurality of pending data is combined to carry out;
When judging that the pending data need not carry out polymerization and calculate, by the processing unit or by the processing number It is commonly calculated according to the calculating service unit is sent to, wherein described to be commonly calculated as being directed to every pending number According to the calculating one by one independently carried out;
When judging that the pending data carries out polymerization calculating, it is single which is sent to calculating service Member carries out polymerization calculating;
The data that processing is completed are preserved by way of random writing to the data storage cell.
Another object of the present invention is to provide a kind of data processing system, the data processing system includes relatively independent Data storage cell, processing unit and the calculating service unit of operation, the data storage cell support the sequence of data to read And random writing;Data friendship is carried out by data-interface between the data storage cell, processing unit and calculating service unit Mutually to realize following functions:
The processing unit reads the pending data in the data storage cell one by one;
According to preset calculating task, judge whether the pending data needs to carry out polymerization calculating, wherein described poly- Total calculate includes the calculating for needing that a plurality of pending data is combined to carry out;
When judging that the pending data need not carry out polymerization and calculate, by the processing unit or by the processing number It is commonly calculated according to the calculating service unit is sent to, wherein described to be commonly calculated as being directed to every pending number According to the calculating one by one independently carried out;
When judging that the pending data carries out polymerization calculating, it is single which is sent to calculating service Member carries out polymerization calculating;
The data that processing is completed are preserved by way of random writing to the data storage cell.
In terms of existing technologies, the invention has the advantages that:
Data processing method and system provided by the invention, by carrying out relatively independent storage, processing and meter to data It calculates, and using the method converted batch processed for Stream Processing, enhances the scalability of the data processing system, and The data processing system is improved when carrying out data processing to the utilization rate of computing resource.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 is the schematic diagram of data processing system provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of data processing method provided in an embodiment of the present invention;
Fig. 3 is the flow diagram of step S110 sub-steps shown in Fig. 2;
Fig. 4 is the schematic diagram of processing unit shown in Fig. 1;
Fig. 5 is the schematic diagram that service unit is calculated shown in Fig. 1;
Fig. 6 is prior art computing resource utilization power schematic diagram;
Fig. 7 is the computing resource utilization power schematic diagram of data processing method provided in an embodiment of the present invention.
Icon:10- data processing systems;100- data storage cells;200- processing units;210- digital independent are single Member;220- data processing subelements;300- scheduling units;400- calculates service unit;410- heavy type computation subunits;420- is logical Use computation subunit;430- polymerize computation subunit.
Specific implementation mode
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.The present invention being usually described and illustrated herein in the accompanying drawings is implemented The component of example can be arranged and be designed with a variety of different configurations.
Therefore, below the detailed description of the embodiment of the present invention to providing in the accompanying drawings be not intended to limit it is claimed The scope of the present invention, but be merely representative of the present invention selected embodiment.Based on the embodiments of the present invention, this field is common The every other embodiment that technical staff is obtained without creative efforts belongs to the model that the present invention protects It encloses.
It should be noted that:Similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent attached drawing in a attached drawing.
Fig. 1 is please referred to, Fig. 1 is the schematic diagram for being the data processing system 10 that present pre-ferred embodiments provide, the number Include data storage cell 100, processing unit 200 and calculating service unit 400 according to processing system 10.
In an embodiment of the present embodiment, the data processing system 10 can be distributed processing system(DPS), In, the data storage cell 100, processing unit 200 or calculating service unit 400 can be in the distributed processing system(DPS) One or more independently operated data processing terminals or virtual machine.
In the another embodiment of the present embodiment, the data processing system 10 may be to operate in a data Processing system on processing terminal, wherein the data storage cell 100, processing unit 200 and calculating service unit 400 are Run on the stand-alone program or process of the data processing system 10.
In the present embodiment, the data storage cell 100 for storing data, support by the data storage cell 100 The sequence of data is read and random writing, and meets the needs of solving mapping-reduction (Map-Reduce).By orderly storing, It may be implemented to compare (compare) function.By mapping mechanism, subregion/break up (partition/shuffle) work(may be implemented Energy.Merge plug-in unit by embedded data, may be implemented to merge (combine) function.In this way, ensure that the data storage is single Pending data in member 100 can carry out streaming and read and write one by one.
Specifically, it in the data storage cell 100 can be the storage system that sequence reading and random writing are provided, institute Stating data storage cell 100 can be but be not limited only to relevant database, NoSql databases, data warehouse or orderly key assignments To storage system.In the present embodiment, it is carried out by taking the data storage cell 100 preferably orderly key-value pair storage system as an example Illustrate, the data storage cell 100 can be Apache Hbase or Google Bigtable systems etc..
The processing unit 200 for reading pending data one by one from the data storage cell 100, to by certainly The data processing process of body calls the calculating service for calculating service unit 400 to carry out at calculating the pending data Reason.
Please refer to Fig. 2, Fig. 2 is a kind of data processing method applied to data processing system 10 shown in FIG. 1, below it is right Each step is described in detail in the method.
Step S110, the processing unit 200 read the pending data in the data storage cell 100 one by one.
Specifically, referring once again to Fig. 1, in the present embodiment, the data processing system 10 can also include that scheduling is single Member 300, the processing unit 200 is obtained pending in the data storage cell 100 one by one by the scheduling unit 300 Data.
In the present embodiment, Fig. 3 is please referred to, the scheduling unit 300 can pass through step S210, step S220 and step S230 carries out the distribution of the pending data.
Pending data in the data storage cell 100 is divided into multiple by step S210, the scheduling unit 300 Data fragmentation, wherein each data fragmentation includes at least one pending data.
In the present embodiment, the scheduling unit 300 is according to preset data processing total duration and preset data fragmentation Pending data in the data storage cell 100 is divided into multiple data fragmentations, wherein the data by handling duration Processing total duration is the total duration needed for all pending datas of processing, and the data fragmentation handling duration is that processing is each described Duration needed for data fragmentation.
Specifically, when the scheduling unit 300 is handled according to preset data processing total duration and preset data fragmentation Long calculate obtains data fragmentation quantity.
The data fragmentation quantity depends on each data fragmentation handling duration, and the data fragmentation handling duration can To be set as (such as 5 minutes) of minute grade.In this way, in data fragmentation handling failure, even if treatment progress fails or is killed It is also minute grade to restore cost afterwards.Compared with the existing technology need half an hour or more long recovery time, the present embodiment to carry The scheme gone out obviously has higher safety and stability.
In said circumstances, calculate the data processing total duration and the data fragmentation handling duration quotient obtain described in The quantity of data fragmentation.For example, when needing all pending datas of interior processing completion when 24 is small, when the data processing is total Length could be provided as 24 hours, and have the data fragmentation handling duration to be set as 5 minutes, then can obtain the number of the data fragmentation Amount is 288.
Further, in order to make computing resource using more smoothly to improve resource utilization, it is also desirable to It is expected that just having handled all fragments in processing total duration.But actual conditions are more complicated, the reality of each data fragmentation Processing time may differ, and the data fragmentation handling duration as expected is 5 minutes, but is possible at 3~7 points under actual conditions Clock is completed, and the overall time of every wheel processing is caused not fully to be fixed.
Therefore, in the present embodiment, when carrying out the data processing of more rounds, one section of reserved stand-by period can be set, If it is expected that processing total duration is 24 hours, that is, it is expected to carry out the data processing of a round in 24 hours, then by the number It is set as 23 hours according to processing total duration, if in the case of fulfiling round processing ahead of schedule, waits for start next round again within 1 hour If processing immediately begins to next round processing in the normal or overtime processing for completing this round.
After determining the data fragmentation quantity, the scheduling unit 300 waits locating according to the data fragmentation quantity by described Reason data are divided into multiple data fragmentations, wherein each data fragmentation corresponds to a range of key values.
Specifically, in the present embodiment, the data storage cell 100 includes being stored by way of key-value pair A plurality of data, per data correspond to a key assignments.The scheduling unit 300 is according to the data fragmentation quantity and needs to be located The range of key values for managing data, multiple data fragmentations are divided by pending data.
For example, in said circumstances, if the key assignments of the pending data is calculated according to 128 MD5 hashed values , then can take preceding 20 bit range of key values 0~1048576 as a whole, according to 0~1,1~2 ..., 1048575~ 1048576 range of key values divides to obtain 1048576 data fragmentations.Or can according to 0~16,16~32,1048560~ 1048576 range of key values divides to obtain 65536 data fragmentations.
It is worth noting that the setting method of the key assignments, is not limited to above-mentioned 128 MD5 hashed values.In the present embodiment In other embodiment, other key assignments set-up modes can also be taken to ensure that the number of data in each range of key values is substantially Close, the handling duration to ensure each data fragmentation is substantially close.For example, in the storage using plaintext string as key assignments In system, can alphabetically divide range of key values, such as by the first two letter of key assignments can according to aa~ac, ac~ Ae ..., the range of key values of zx~zz divides to obtain 338 data fragmentations etc..
Step S220, for each data fragmentation, the scheduling unit 300 is that the data fragmentation distributes described in one The treatment progress of processing unit 200 is used to handle the pending data in the data fragmentation.
In an embodiment of the present embodiment, the scheduling unit 300 obtain handled in the processing unit 200 into The working condition of journey, and pending data fragmentation is pushed to treatment progress according to the working condition.
Specifically, the scheduling unit 300 is according to the working condition of the treatment progress of the processing unit 200, from described At least one data fragmentation is chosen in data storage cell 100, and the corresponding range of key values of the data fragmentation is sent to the place Manage an idle treatment progress of unit 200, wherein the free time treatment progress can be the process for not carrying out data processing, It can certainly be the less process of data processing task.
In the another embodiment of the present embodiment, some treatment progress of the processing unit 200 is not into line number When according to processing, data fragmentation that can be in data storage cell 100 described in active pull carries out data processing.
Specifically, when the 200 available free treatment progress of processing unit, the free time treatment progress is to the scheduling unit 300 transmission data fragments obtain request.The scheduling unit 300 is sent according to the idle treatment progress of the processing unit 200 Data fragmentation obtain request, at least one data fragmentation is chosen from the data storage cell 100, by the data fragmentation pair The range of key values answered is sent to the free time treatment progress of the processing unit 200.
Further, the scheduling unit 300 is according to the number for carrying out data processing in the data storage cell 100 The selection of the data fragmentation is carried out according to the quantity of fragment.
Specifically, the data storage cell 100 would generally either statically or dynamically mark off multiple storages according to data scale Subregion, each partition holding operate on certain machine of distributed system, and safeguard a part of data.In order to avoid the storage Data congestion (such as read-write congestion and calculating hot spot etc.) where subregion on machine, the scheduling unit 300 should avoid as possible It distributes simultaneously and processing falls multiple data fragmentations on the same partition holding, to prevent these data fragmentations processing time super It crosses and is expected, and bring the risk of disposed of in its entirety time time-out.
Therefore in the present embodiment, the scheduling unit 300 obtains the number that data processing is being carried out in each partition holding According to the quantity of fragment, the partition holding for the data fragmentation minimum number for carrying out data processing is chosen as target storage point Area, and choose a data fragmentation from the target partition holding and be sent to the idle treatment progress.
It is worth noting that the choosing method of above-mentioned data fragmentation is only a kind of embodiment provided in this embodiment, In the other embodiment of the present embodiment, the selection of the data fragmentation can also be carried out according to other scheduling strategies, to reduce The risk of data congestion.
Step S230, the treatment progress of the processing unit 200 obtain pending in assigned data fragmentation one by one Data.
Further, Fig. 4 is please referred to, the processing unit 200 may include digital independent subelement 210 and data processing Subelement 220.The digital independent subelement 210 is used to read pending data one by one from the data storage cell 100, institute It includes treatment progress to state data processing subelement 220, for carrying out data processing to the pending data.
In the present embodiment, the digital independent subelement 210 is according to the range of key values of the data fragmentation, successively from institute State the pending data obtained one by one in data storage cell 100 in the range of key values.In the present embodiment, data actual treatment Sequence and read in memory sequence can be consistent, can not also be consistent, depend on calculating task setting.If need not keep It is completely the same, then the mechanism such as multithreading can be set and carry out concurrent processing, to improve speed.
Each data processing subelement 220 obtains the pending data given in range of key values, and enters step S120 and open Begin to carry out Stream Processing one by one to the pending data.
Step S120 judges whether the pending data needs to carry out polymerization calculating according to preset calculating task.
The data processing subelement 220 is directed to every pending data, according to calculating task to this pending data Judged.
The preset calculating task can be the data processing topological structure pre-set, be wrapped in the topological structure It includes pending data and needs the data processing node flowed through.The calculating task includes that the common calculating and polymerization calculate, In, the common calculating one by one for being calculated as independently carrying out for every pending data, the polymerization calculating includes needing The calculating to be carried out in conjunction with a plurality of pending data.
Therefore in the present embodiment, the data processing subelement 220 judges described pending according to preset calculating task Whether data need to carry out polymerization calculating, to execute different calculation processings.
When that need not carry out polymerization calculating to the pending data, go to step S130, when needs are waited for described When processing data carry out polymerization calculating, go to step S140.
Step S130, be sent to by the processing unit 200 or by the processing data calculating service unit 400 into Row is common to be calculated.
The common calculating includes that light-duty calculating, heavy calculating and general-purpose computations, the light-duty calculating include preset meter The smaller service logic calculation processing of calculation amount, heavy calculate includes the larger calculation processing of preset calculation amount, described logical Include the general calculation processing of preset different computing tasks with calculating.
It may include one of following manner or the combination between it to carry out the common calculation.
When carrying out light-duty calculation processing, light-duty calculating is carried out to the pending data in the processing unit 200 Processing.
The data processing subelement 220 of the processing unit 200 can be used for executing the light-duty calculation processing, described light Type calculation processing is the smaller simple computation of performance cost, for example, the conversion of certain field format in data, from one or more Field derives a new field etc..The light-duty calculation processing can also be one and occupy the less complicated calculation of computing resource Method.
Fig. 5 is please referred to, the calculating service unit 400 may include heavy computation subunit 410, general-purpose computations subelement 420 or polymerization computation subunit 430.
When needing to carry out heavy calculate to the pending data, the heavy type of the heavy computation subunit 410 is called The service of calculating carries out the pending data heavy calculation processing one by one.
For the heavy type computation subunit 410 for executing the heavy calculation processing, the heavy type calculation processing is to need Expend a large amount of complicated calculations calculated with Internet resources.For example, heavy calculation processing usually requires prodigious memory, for example need Prodigious configuration file is loaded, or needs to preserve prodigious ephemeral data etc. in memory.
When needing to carry out general-purpose computations to the pending data, the general of the general-purpose computations subelement 420 is called The service of calculating carries out general-purpose computations processing one by one to the pending data.
For the general-purpose computations subelement 420 for executing the general-purpose computations processing, the general-purpose computations processing is to have very The data handling procedure that multi-service flow can all be shared.Such as in search engine off-line data process flow, to webpage url's Mutual turn of standardization processing, duplicate removal processing, pc pages and move page etc., can all be shared by multiple operation flows.For another example, general-purpose computations Task can be the key-value pair retrieval service provided by general random key-value pair storage system (such as redis, tair).
The pending data is sent to calculating service unit 400 and carries out polymerization calculating by step S140.
When needing to carry out polymerization calculating to the pending data, which is sent to the polymerization and is calculated Subelement 430 carries out polymerization calculating.
It due to the data stored by key-value pair mode, is usually independent from each other between data, and is carrying out full dose data In the case of processing, it is sometimes desirable to which a plurality of data aggregate is calculated, i.e., has certain correlation between data.For example, institute It includes the number for needing to count the appearance of certain class data to state polymerization and calculate, or needs to be ranked up data by certain rule.
In an embodiment of the present embodiment, pending data is sent to the polymerization by the processing unit 200 Computation subunit 430.It is pending in processing time section to presetting that the polymerization computation subunit 430 calls polymerization to calculate service Data carry out partial polymerization calculating, or carry out global polymerization to all pending datas and calculate.
Further, the system 10 further includes polymerization result storage unit.The calculating service unit 400 counts polymerization Result during calculation is preserved to the polymerization result storage unit, so that when calculating service unit 400 is restarted, is gathered from described Close the result for restoring that the global polymerization calculates or partial polymerization calculates in result storage unit again.
Under certain business scenarios, the polymerization of above-mentioned pure streaming is calculated there are certain restrictions, i.e. partial polymerization can influence to gather Integrality is closed, and global polymerization is to the more demanding of computing resource especially data buffer storage.Therefore it provides in the present embodiment another Kind realizes the embodiment that polymerization calculates, and in this embodiment, pending data is sent to described by the processing unit 200 Calculate service unit 400.Then the calculating service unit 400 calls polymerization to calculate service advance to the data progress received It handles one by one, and records the results of intermediate calculations for handling acquisition one by one in advance, after pre-set delay duration, execute batch meter Calculation task carries out unified polymerization to the results of intermediate calculations and calculates.
Based on above-mentioned design, is handled one by one by streaming and complete most of time-consuming more calculating step, obtain data Then the results of intermediate calculations of scale is smaller is calculated by batch wise polymerization and integrates the results of intermediate calculations.For example, solid daily Fixed time point carries out batch wise polymerization processing to the results of intermediate calculations that streaming before is handled one by one.By flowing one by one above The method of formula processing cooperation batch wise polymerization processing, can obtain more complete polymerization result of calculation, and locate one by one by streaming The combination of reason and batch wise polymerization processing can make the use of the computing resource of the data processing system 10 more smooth, improve The utilization rate of computing resource.
It polymerize after calculation processing as a result, currently processed process can be returned to, is also sent to and needs the polymerization knot Other treatment progress of fruit.In this way, both realized the polymerizable functional of lot data, can also make full use of that streaming computing brings is The promotion for the scalability and resource utilization of uniting.
In practical applications, data processing system 10 may be performed simultaneously multiple calculating tasks, and each calculating task needs Individual data processing is carried out to pending data.The method provided through this embodiment, by being converted to batch processed Stream Processing becomes smoothly to use in 24 hours 1 day the processing to data from concentrating a period of time to monopolize cluster resource The resource quota (quota) that one relatively fixed in cluster resource.Therefore, under the scene that multiple calculating tasks are carried out at the same time, Each calculating task is assigned certain resource quota, and is run with same priority, then resource contention between calculating task Situation can be greatly reduced, to improve resource utilization.
Further, the processing unit 200 detects whether the data fragmentation handles completion in data processing, The data fragmentation is identified when data fragmentation processing is completed.
During carrying out data processing, if some treatment progress exits (such as fail or killed) extremely, it can lead Cause the remaining data of the data fragmentation for the treatment of progress processing that can not handle.In the present embodiment, can be arranged it is described handle into Journey to it is processed at data fragmentation be identified, e.g., ensure data quilt by mechanism such as checkpoints (checkpoint) All processing.
Specifically, the treatment progress can record a checkpoint, table after the data for having handled a data fragmentation It is bright to have handled the data fragmentation.In order to ensure checkpoint will not be lost, it can be recorded in distributed memory system. After treatment progress exits and restarts, the checkpoint can be checked, untreated complete data fragmentation, which can be reallocated, goes forward side by side Row processing.Since each data fragmentation processing time is minute grade, the cost very little that treatment progress restores.
Particularly, above-mentioned recovery scheme can ensure that the data of at-least-once (i.e. same data are at least written once) Consistency.If operation flow needs to realize at-most-once (the at most write-in of i.e. same data is primary) or exactly-once The strong consistency of (i.e. same data must just be written once), reprocessing and write-back can cause data different with batch of data Often.It is therefore desirable to have other external mechanism ensures the strong consistency, for example, by using the Trident of streaming computing platform Storm Function module.
Step S150 preserves the data that processing is completed by way of random writing to the data storage cell 100.
The processing unit 200 will treated write back data to the data storage cell 100, in the present embodiment, The key assignments for data that treated may have occurred that variation compared with the key assignments of the same data read in step s 110. Or in processing procedure, the data read in step s 110 are derived as a plurality of new data and respectively tool by calculation processing There is different key assignments.Therefore, in step S150 by write back data to storage system when, can not ensure the order of key assignments again. Therefore, in the present embodiment, write back data is random without being ordered into.
Based on above-mentioned design, the present embodiment stores data, light-duty calculation processing and calculate between service only by IO or Service call couples, and each module is relatively independent, and can facilitate and linearly carry out behavior extension, such as simply Ground increases calculating/storage resource.For example, when carrying out data dilatation, as long as correspondingly increasing data storage cell 100 The quantity of capacity or partition holding, increase processing unit 200 treatment progress quantity, and increase calculate service unit 400 into Number of passes, and three is independent of each other, and practical operability is very strong, and O&M risk is also relatively low.Therefore, the present embodiment Method have good performance scalability.
In addition, after making batch processed into Stream Processing, can be stored with higher and identical priority to execute data Unit 100, processing unit 200 and the task process for calculating service unit 400.Each data processing task is assigned fixed quota The resources such as memory/cpu/ networks.Then, the resource utilization of entire cluster becomes very smooth, and each data processing is appointed Resource contention between business reduces and (reduces the possibility for entering journey mutually), and cluster resource overall utilization rate has obtained carrying greatly very much It rises.Fig. 6 and Fig. 7 is please referred to, Fig. 6 is the utilization power of cluster resource in prior art batch processed scheme, and Fig. 7 is the present embodiment The utilization power of middle cluster resource, wherein abscissa is the time, and ordinate is the occupancy of computing resource., it is apparent that Cluster resource use becomes smoother in the present embodiment, and the promotion of cluster resource utilization rate is embodied in indirectly to data dilatation Support on.After Stream Processing, under same cluster scale, the scale of supported data processing increase 30% with On.
In conclusion data processing method provided by the invention and system, by data carry out relatively independent storage, Scheduling and processing, and using method batch processed converted for Stream Processing, enhance the data processing system 10 Scalability, and the data processing system 10 is improved when carrying out data processing to the utilization rate of computing resource.
In embodiment provided herein, it should be understood that disclosed device and method, it can also be by other Mode realize.The apparatus embodiments described above are merely exemplary, for example, the flow chart and block diagram in attached drawing are shown The device of multiple embodiments according to the present invention, the architectural framework in the cards of method and computer program product, function And operation.In this regard, each box in flowchart or block diagram can represent one of a module, section or code Point, a part for the module, section or code includes one or more for implementing the specified logical function executable Instruction.It should also be noted that at some as in the realization method replaced, the function of being marked in box can also be attached to be different from The sequence marked in figure occurs.For example, two continuous boxes can essentially be basically executed in parallel, they also may be used sometimes To execute in the opposite order, this is depended on the functions involved.It is also noted that each of block diagram and or flow chart The combination of box in box and block diagram and or flow chart, function or the dedicated of action are based on as defined in execution The system of hardware is realized, or can be realized using a combination of dedicated hardware and computer instructions.
In addition, each function module in each embodiment of the present invention can integrate to form an independent portion Point, can also be modules individualism, can also two or more modules be integrated to form an independent part.
It, can be with if the function is realized and when sold or used as an independent product in the form of software function module It is stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be expressed in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, any made by repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.It should be noted that:Similar label and letter exist Similar terms are indicated in following attached drawing, therefore, once being defined in a certain Xiang Yi attached drawing, are then not required in subsequent attached drawing It is further defined and is explained.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (22)

1. a kind of data processing method is applied to data processing system, which is characterized in that the data processing system includes opposite Independently operated data storage cell, processing unit and calculating service unit, the data storage cell support the sequence of data Reading and random writing;The method includes:
The processing unit reads the pending data in the data storage cell one by one;
According to preset calculating task, judge whether the pending data needs to carry out polymerization calculating, wherein the polymerization meter It includes the calculating for needing that a plurality of pending data is combined to carry out to calculate;
When judging that the pending data need not carry out polymerization calculating, sent out by the processing unit or by the processing data Send to the calculating service unit and commonly calculated, wherein it is described it is common be calculated as it is only for every pending data The vertical calculating one by one carried out;
When judging that the pending data carries out polymerization and calculates, by the pending data be sent to calculating service unit into Row polymerization calculates;
The data that processing is completed are preserved by way of random writing to the data storage cell.
2. according to the method described in claim 1, it is characterized in that, described be sent to calculating service unit by the pending data The step of polymerization calculates is carried out, including:
Pending data is sent to the calculating service unit by the processing unit;
It is local poly- that the calculating service unit calls polymerization calculating service to carry out the pending data in default processing time section It is total to calculate, or global polymerization is carried out to all pending datas and is calculated.
3. according to the method described in claim 2, it is characterized in that, the system also includes polymerization result storage units;It is described The pending data is sent to and calculates the step of service unit carries out polymerization calculating, further includes:
The service unit that calculates preserves the result polymerizeing in calculating process to the polymerization result storage unit, so that counting When calculation service unit is restarted, restore the global polymerization calculating or partial polymerization meter again from the polymerization result storage unit The result of calculation.
4. according to the method described in claim 1, it is characterized in that, described be sent to calculating service unit by the pending data The step of polymerization calculates is carried out, including:
Pending data is sent to the calculating service unit by the processing unit;
The calculating service unit calling polymerization calculates service and to the data received handle one by one in advance, and described in record The results of intermediate calculations for handling acquisition one by one in advance executes batch calculating task to the intermediate meter after pre-set delay duration It calculates result and carries out unified polymerization calculating.
5. according to the method described in claim 1, it is characterized in that, the common calculating include light-duty calculating, heavy calculating and General-purpose computations, the light-duty calculating include the smaller service logic calculation processing of preset calculation amount, and heavy calculate includes The larger calculation processing of preset calculation amount, the general-purpose computations include the general calculation processing of preset different computing tasks; It is described that the step of calculating service unit is commonly calculated, packet are sent to by the processing unit or by the processing data It includes:
When needing to carry out light-duty calculating to the pending data, wait locating to described in the treatment progress of the processing unit Reason data carry out light-duty calculating one by one;
When needing calculate heavy to pending data progress, the heavy type for calculating service unit is called to calculate service pair The pending data carries out heavy calculation processing one by one;
When needing to carry out general-purpose computations to the pending data, the general-purpose computations service pair for calculating service unit is called The pending data carries out general-purpose computations processing one by one.
6. described according to the method described in claim 1, it is characterized in that, the data processing system further includes scheduling unit Processing unit is read the step of pending data in the data storage cell one by one, including:
Pending data in the data storage cell is divided into multiple data fragmentations by the scheduling unit, wherein each The data fragmentation includes at least one pending data;
For each data fragmentation, the scheduling unit be the data fragmentation distribute the processing of a processing unit into Journey is used to handle the pending data in the data fragmentation;
The treatment progress of the processing unit obtains the pending data in assigned data fragmentation one by one.
7. according to the method described in claim 6, it is characterized in that, the data storage cell includes by way of key-value pair The a plurality of data stored correspond to a key assignments per data;The scheduling unit will wait in the data storage cell The step of processing data are divided into multiple data fragmentations, including:
The scheduling unit calculates according to preset data processing total duration and preset data fragmentation handling duration and obtains data Fragment quantity;
The pending data is divided into multiple data fragmentations according to the data fragmentation quantity, wherein each data Fragment corresponds to a range of key values.
8. the method according to the description of claim 7 is characterized in that described for each data fragmentation, the scheduling is single Member is that the data fragmentation distributes the treatment progress of a processing unit for handling the pending data in the data fragmentation The step of, including:
The scheduling unit is chosen according to the working condition of the treatment progress of the processing unit from the data storage cell At least one data fragmentation, the corresponding range of key values of the data fragmentation is sent to the processing unit a free time handle into Journey.
9. the method according to the description of claim 7 is characterized in that described for each data fragmentation, the scheduling is single Member is that the data fragmentation distributes the treatment progress of a processing unit for handling the pending data in the data fragmentation The step of, including:
The scheduling unit obtains request according to the data fragmentation that the idle treatment progress of the processing unit is sent, from the number According at least one data fragmentation is chosen in storage unit, the corresponding range of key values of the data fragmentation is sent to the processing unit An idle treatment progress.
10. method according to claim 8 or claim 9, which is characterized in that the data storage cell includes multiple storages point Area;Described the step of choosing at least one data fragmentation from the data storage cell, including:
The scheduling unit obtains the quantity for the data fragmentation that data processing is being carried out in each partition holding, chooses just In the partition holding for carrying out the data fragmentation minimum number of data processing as target partition holding, and from the target partition holding One data fragmentation of middle selection.
11. according to the method described in claim 6, it is characterized in that, the method further includes:
The processing unit detects whether the data fragmentation handles completion, when data fragmentation processing is completed to the data Fragment is identified.
12. a kind of data processing system, which is characterized in that the data processing system includes the data storage of relatively independent operation Unit, processing unit and calculating service unit, the sequence of data of the data storage cell supporting is read and random writing;It is described Data interaction is carried out to realize following work(by data-interface between data storage cell, processing unit and calculating service unit Energy:
The processing unit reads the pending data in the data storage cell one by one;
According to preset calculating task, judge whether the pending data needs to carry out polymerization calculating, wherein the polymerization meter It includes the calculating for needing that a plurality of pending data is combined to carry out to calculate;
When judging that the pending data need not carry out polymerization calculating, sent out by the processing unit or by the processing data Send to the calculating service unit and commonly calculated, wherein it is described it is common be calculated as it is only for every pending data The vertical calculating one by one carried out;
When judging that the pending data carries out polymerization and calculates, by the pending data be sent to calculating service unit into Row polymerization calculates;
The data that processing is completed are preserved by way of random writing to the data storage cell.
13. system according to claim 12, which is characterized in that the side for calculating service unit and carrying out polymerization calculating Formula, including:
Obtain the pending data that the processing unit is sent;
It calls polymerization to calculate service and partial polymerization calculating is carried out to the pending data preset in processing time section, or to being needed It handles data and carries out global polymerization calculating.
14. system according to claim 13, which is characterized in that the system also includes polymerization result storage units, use In receive and preserve it is described calculating service unit generate polymerization calculating process in as a result, so that calculating service unit restart When, restore the result that the global polymerization calculates or partial polymerization calculates again from the polymerization result storage unit.
15. system according to claim 12, which is characterized in that the side for calculating service unit and carrying out polymerization calculating Formula, including:
Obtain the pending data that the processing unit is sent;
It calls polymerization calculating service to handle the data progress received one by one in advance, and records the processing acquisition one by one in advance Results of intermediate calculations execute batch calculating task after pre-set delay duration and the results of intermediate calculations unified Polymerization calculates.
16. system according to claim 12, which is characterized in that the common calculating includes light-duty calculating, heavy calculating And general-purpose computations, the light-duty calculating include the smaller service logic calculation processing of preset calculation amount, heavy calculate is wrapped The larger calculation processing of preset calculation amount is included, the general-purpose computations include at the general calculating of preset different computing tasks Reason;The processing data are sent to the mode that the calculating service unit is commonly calculated by the processing unit, including:
When needing to carry out light-duty calculating to the pending data, wait locating to described in the treatment progress of the processing unit Reason data carry out light-duty calculating one by one;
When needing calculate heavy to pending data progress, the heavy type for calculating service unit is called to calculate service pair The pending data carries out heavy calculation processing one by one;
When needing to carry out general-purpose computations to the pending data, the general-purpose computations service pair for calculating service unit is called The pending data carries out general-purpose computations processing one by one.
17. system according to claim 12, which is characterized in that the data processing system further includes scheduling unit, is used In the pending data in the data storage cell is divided into multiple data fragmentations, wherein each data fragmentation packet Include at least one pending data;
For each data fragmentation, the scheduling unit be the data fragmentation distribute the processing of a processing unit into Journey is used to handle pending data in the data fragmentation so that the treatment progress of the processing unit obtains one by one it is assigned Pending data in data fragmentation.
18. system according to claim 17, which is characterized in that the data storage cell includes the side by key-value pair The a plurality of data that formula is stored correspond to a key assignments per data;The scheduling unit will be in the data storage cell Pending data is divided into the mode of multiple data fragmentations, including:
The scheduling unit calculates according to preset data processing total duration and preset data fragmentation handling duration and obtains data Fragment quantity;
The pending data is divided into multiple data fragmentations according to the data fragmentation quantity, wherein each data Fragment corresponds to a range of key values.
19. system according to claim 18, which is characterized in that the scheduling unit is the data fragmentation allocation processing The mode of process, including:
The scheduling unit is chosen according to the working condition of the treatment progress of the processing unit from the data storage cell At least one data fragmentation, the corresponding range of key values of the data fragmentation is sent to the processing unit a free time handle into Journey.
20. system according to claim 18, which is characterized in that the scheduling unit is the data fragmentation allocation processing The mode of process, including:
The scheduling unit obtains request according to the data fragmentation that the idle treatment progress of the processing unit is sent, from the number According at least one data fragmentation is chosen in storage unit, the corresponding range of key values of the data fragmentation is sent to the processing unit An idle treatment progress.
21. the system according to claim 19 or 20, which is characterized in that the data storage cell includes multiple storages point Area;The scheduling unit chooses the step of data fragmentation, including:
The scheduling unit obtains the quantity for the data fragmentation that data processing is being carried out in each partition holding, chooses just In the partition holding for carrying out the data fragmentation minimum number of data processing as target partition holding, and from the target partition holding One data fragmentation of middle selection.
22. system according to claim 17, which is characterized in that the processing unit is additionally operable to:
Detect whether the data fragmentation handles completion, when data fragmentation processing is completed to the data fragmentation into rower Know.
CN201710198858.7A 2017-03-29 2017-03-29 Data processing method and system Pending CN108664322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710198858.7A CN108664322A (en) 2017-03-29 2017-03-29 Data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710198858.7A CN108664322A (en) 2017-03-29 2017-03-29 Data processing method and system

Publications (1)

Publication Number Publication Date
CN108664322A true CN108664322A (en) 2018-10-16

Family

ID=63786866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710198858.7A Pending CN108664322A (en) 2017-03-29 2017-03-29 Data processing method and system

Country Status (1)

Country Link
CN (1) CN108664322A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259060A (en) * 2020-02-18 2020-06-09 北京百度网讯科技有限公司 Data query method and device
WO2021093461A1 (en) * 2019-11-11 2021-05-20 蚂蚁区块链科技(上海)有限公司 Method and apparatus for aggregation calculation in blockchain-type ledger, and device
CN114398021A (en) * 2022-01-11 2022-04-26 北京大唐神州科技有限公司 Low code delivery method based on software development

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990238A (en) * 2010-11-05 2011-03-23 中国科学院声学研究所 Method for aggregating sensor network data
CN102521405A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Massive structured data storage and query methods and systems supporting high-speed loading
CN103281728A (en) * 2013-05-17 2013-09-04 福建星网锐捷网络有限公司 Message aggregation method and device and network equipment
CN104317958A (en) * 2014-11-12 2015-01-28 北京国双科技有限公司 Method and system for processing data in real time
CN104731969A (en) * 2015-04-10 2015-06-24 北京大学深圳研究生院 Mass data join aggregation query method, device and system in distributed environment
CN105183917A (en) * 2015-10-15 2015-12-23 国家电网公司 Multi-dimensional analysis method for multi-level storage data
CN105204920A (en) * 2014-06-18 2015-12-30 阿里巴巴集团控股有限公司 Distributed calculation operation realizing method and device based on mapping and polymerizing
CN105930203A (en) * 2015-12-29 2016-09-07 中国银联股份有限公司 Method and apparatus for controlling message distribution

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990238A (en) * 2010-11-05 2011-03-23 中国科学院声学研究所 Method for aggregating sensor network data
CN102521405A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Massive structured data storage and query methods and systems supporting high-speed loading
CN103281728A (en) * 2013-05-17 2013-09-04 福建星网锐捷网络有限公司 Message aggregation method and device and network equipment
CN105204920A (en) * 2014-06-18 2015-12-30 阿里巴巴集团控股有限公司 Distributed calculation operation realizing method and device based on mapping and polymerizing
CN104317958A (en) * 2014-11-12 2015-01-28 北京国双科技有限公司 Method and system for processing data in real time
CN104731969A (en) * 2015-04-10 2015-06-24 北京大学深圳研究生院 Mass data join aggregation query method, device and system in distributed environment
CN105183917A (en) * 2015-10-15 2015-12-23 国家电网公司 Multi-dimensional analysis method for multi-level storage data
CN105930203A (en) * 2015-12-29 2016-09-07 中国银联股份有限公司 Method and apparatus for controlling message distribution

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021093461A1 (en) * 2019-11-11 2021-05-20 蚂蚁区块链科技(上海)有限公司 Method and apparatus for aggregation calculation in blockchain-type ledger, and device
CN111259060A (en) * 2020-02-18 2020-06-09 北京百度网讯科技有限公司 Data query method and device
CN111259060B (en) * 2020-02-18 2023-08-15 北京百度网讯科技有限公司 Data query method and device
CN114398021A (en) * 2022-01-11 2022-04-26 北京大唐神州科技有限公司 Low code delivery method based on software development
CN114398021B (en) * 2022-01-11 2022-09-06 北京大唐神州科技有限公司 Low code delivery method based on software development

Similar Documents

Publication Publication Date Title
CN106407207B (en) Real-time newly-added data updating method and device
US9143562B2 (en) Managing transfer of data from a source to a destination machine cluster
CN109669776B (en) Detection task processing method, device and system
CN104537076B (en) A kind of file read/write method and device
US9886311B2 (en) Job scheduling management
US10356150B1 (en) Automated repartitioning of streaming data
US8843632B2 (en) Allocation of resources between web services in a composite service
US20090300040A1 (en) Table partitioning and storage in a database
CN106713396B (en) Server scheduling method and system
CN106406987A (en) Task execution method and apparatus in cluster
CN106815254A (en) A kind of data processing method and device
CN108399175B (en) Data storage and query method and device
CN103077197A (en) Data storing method and device
CN104407879A (en) A power grid timing sequence large data parallel loading method
CN104504147A (en) Resource coordination method, device and system for database cluster
CN110321364B (en) Transaction data query method, device and terminal of credit card management system
CN110807145A (en) Query engine acquisition method, device and computer-readable storage medium
CN113835823A (en) Resource scheduling method and device, electronic equipment and computer readable storage medium
CN108664322A (en) Data processing method and system
CN110414865A (en) A kind of distribution method, device, computer equipment and the storage medium of audit task
CN108228432A (en) A kind of distributed link tracking, analysis method and server, global scheduler
US20160117107A1 (en) High Performance Hadoop with New Generation Instances
CN107566341A (en) A kind of data persistence storage method and system based on federal distributed file storage system
CN115442262B (en) Resource evaluation method and device, electronic equipment and storage medium
Xiang et al. Optimizing job reliability through contention-free, distributed checkpoint scheduling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200526

Address after: 310051 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Alibaba (China) Co.,Ltd.

Address before: 510000 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 13 layer self unit 01 (only for office use)

Applicant before: GUANGZHOU SHENMA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20181016

RJ01 Rejection of invention patent application after publication