[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN103377251A - File comparison method and device for HDFS (Hadoop Distributed File System) - Google Patents

File comparison method and device for HDFS (Hadoop Distributed File System) Download PDF

Info

Publication number
CN103377251A
CN103377251A CN201210130345XA CN201210130345A CN103377251A CN 103377251 A CN103377251 A CN 103377251A CN 201210130345X A CN201210130345X A CN 201210130345XA CN 201210130345 A CN201210130345 A CN 201210130345A CN 103377251 A CN103377251 A CN 103377251A
Authority
CN
China
Prior art keywords
file
proof test
hdfs
data blocks
crc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210130345XA
Other languages
Chinese (zh)
Other versions
CN103377251B (en
Inventor
潘瑾瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210130345.XA priority Critical patent/CN103377251B/en
Publication of CN103377251A publication Critical patent/CN103377251A/en
Application granted granted Critical
Publication of CN103377251B publication Critical patent/CN103377251B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Error Detection And Correction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a file comparison method and device for an HDFS. The file comparison method for the HDFS of the embodiment of the invention comprises the steps of: obtaining information of a first file and a second file from a master node of the HDFS; comparing whether the number of a plurality of first data blocks constituting the first file is identical to the number of second data blocks constituting the second file; if so, obtaining a plurality of first crc (Cyclic Redundancy Check) values of the plurality of first data blocks and a plurality of second crc values of the plurality of second data blocks from slave nodes of the HDFS; comparing the plurality of crc values and the plurality of second crc values respectively sequentially; if the comparative results are the same, judging that the first file is identical with the second file; if the comparative results are different, judging that the first file is different from the second file. According to the file comparison method for the HDFS, network transmission quantity can be saved and file comparison efficiency can be improved.

Description

The file comparision method and the device that are used for HDFS
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of file comparision method for HDFS and device.
Background technology
HDFS (Hadoop Distributed File System) is a kind of distributed file system.It has the characteristics of high fault tolerance, and the data that provide high transmission rates to visit application program, is fit to have the application program of super large data set.
When comparing for the file on the HDFS, traditional file comparision method of use comprises:
1. direct comparison method: will need first two files of comparison to download to this locality from HDFS, and then compare by file compare tools such as diff in this locality;
2. cryptographic hash relative method: will need first two files of comparison to download to this locality from HDFS, and then respectively two files be carried out the calculating of cryptographic hash, and for example adopt the md5 algorithm, and at last the md5 value of calculating be compared.
More than two kinds of methods all need download file, and be that file is carried out byte-by-byte comparison, have the shortcoming that transmission volume is large, relative efficiency is lower, shortcoming is more outstanding when particularly large file being compared.
Summary of the invention
The present invention is intended to one of solve the problems of the technologies described above at least.
For this reason, one object of the present invention is to propose a kind of file comparision method that is used for HDFS that can save Internet Transmission and improve relative efficiency.
Another object of the present invention is to propose a kind of file comparison means for HDFS.
To achieve these goals, the file comparision method that is used for HDFS of embodiment may further comprise the steps according to a first aspect of the invention: A. obtains the information of the first file and the second file from the host node of HDFS; Whether B. relatively consist of the quantity of a plurality of the first data blocks of the first file according to described information identical with the quantity of a plurality of the second data blocks that consist of the second file; C. if then obtain a plurality of crc proof test values of described a plurality of the first data blocks and a plurality of the 2nd crc proof test values of described a plurality of the second data blocks from HDFS from node; D. described a plurality of crc proof test values and described a plurality of the 2nd crc proof test value are compared respectively in order; If E. comparative result is identical, judge that then described the first file is identical with described the second file; If F. comparative result is different, judge that then described the first file is different with described the second file.
According to the file comparision method that is used for HDFS of the embodiment of the invention, can save transmission volume, and improve file efficient relatively.
To achieve these goals, the file comparison means that is used for HDFS of embodiment comprises according to a second aspect of the invention: the acquisition of information module, and described acquisition of information module is used for obtaining from the host node of HDFS the information of the first file and the second file; The first comparison module, whether described the first comparison module is identical with the quantity of a plurality of the second data blocks that consist of the second file for the quantity of a plurality of the first data blocks that relatively consist of the first file according to described information; Crc proof test value acquisition module, described crc proof test value acquisition module is used in the quantity of a plurality of the first data blocks that consist of the first file situation identical with the quantity of a plurality of second data blocks of formation the second file, obtains a plurality of crc proof test values of described a plurality of the first data blocks and a plurality of the 2nd crc proof test values of described a plurality of the second data blocks from HDFS from node; The second comparison module, described the second comparison module are used for described a plurality of crc proof test values and described a plurality of the 2nd crc proof test value are compared respectively in order; And judge module, described judge module is used for judging that when comparative result is identical described the first file is identical with described the second file, and comparative result does not judge simultaneously that described the first file is different with described the second file.
According to the file comparison means that is used for HDFS of the embodiment of the invention, it is few to expend Internet Transmission, and the file relative efficiency is high.
The aspect that the present invention adds and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Description of drawings
Above-mentioned and/or the additional aspect of the present invention and advantage be from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein,
Fig. 1 is the process flow diagram that is used for according to an embodiment of the invention the file comparision method of HDFS;
Fig. 2 is the process flow diagram that is used for according to an embodiment of the invention the file comparision method of HDFS;
Fig. 3 is the process flow diagram that is used for according to an embodiment of the invention the file comparision method of HDFS;
Fig. 4 is the structured flowchart that is used for according to an embodiment of the invention the file comparison means of HDFS;
Fig. 5 is the structured flowchart that is used for according to an embodiment of the invention the file comparison means of HDFS; And
Fig. 6 is the structured flowchart that is used for according to an embodiment of the invention the file comparison means of HDFS.
Embodiment
The below describes embodiments of the invention in detail, and the example of described embodiment is shown in the drawings, and wherein identical or similar label represents identical or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, only be used for explaining the present invention, and can not be interpreted as limitation of the present invention.On the contrary, embodiments of the invention comprise spirit and interior all changes, modification and the equivalent of intension scope that falls into additional claims.
In description of the invention, it will be appreciated that term " first ", " second " etc. only are used for describing purpose, and can not be interpreted as indication or hint relative importance.In description of the invention, need to prove that unless clear and definite regulation and restriction are arranged in addition, term " links to each other ", " connection " should do broad understanding, for example, can be to be fixedly connected with, and also can be to removably connect, or connect integratedly; Can be mechanical connection, also can be to be electrically connected; Can be directly to link to each other, also can indirectly link to each other by intermediary.For the ordinary skill in the art, can concrete condition understand above-mentioned term concrete meaning in the present invention.In addition, in description of the invention, except as otherwise noted, the implication of " a plurality of " is two or more.
Describe and to be understood in the process flow diagram or in this any process of otherwise describing or method, expression comprises module, fragment or the part of code of the executable instruction of the step that one or more is used to realize specific logical function or process, and the scope of preferred implementation of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by opposite order, carry out function, this should be understood by the embodiments of the invention person of ordinary skill in the field.
In order more clearly to set forth the file comparision method for HDFS of the present invention and device, now HDFS is described further.In the HDFS system, the method for organizing of the data of a file is as follows: file is cut into the data block of several fixed sizes, and data block is stored in randomly (is also referred to as DataNode) from node, crc (the Cyclical Redundancy Check of a this data block also can be preserved in the position of store data piece simultaneously, cyclic redundancy check (CRC)) proof test value data, the information that file which data block is made of then exist on the host node and (are also referred to as NameNode).
For example, the size of establishing the F file is 10G, if the data block of setting among HDFS size is 256M, then this document is divided into 40 data blocks; The default setting cyclic redundancy check method is crc32 in system, crc proof test value data of corresponding 4 bytes of per 512 bytes in every data block then, and each data block comprises the crc proof test value data of 2k.Thus, storing the positional information of 40 data blocks that form the F file on the host node of HDFS, one or more file content and length corresponding to each data block of storing 40 data blocks from node of HDFS are the crc proof test value data of 2k.
Below with reference to the file comparision method that be used for HDFS of Figure of description description according to the embodiment of the invention.
A kind of file comparision method for HDFS may further comprise the steps: the information of obtaining the first file and the second file from the host node of HDFS; Whether the quantity of a plurality of the first data blocks that relatively consists of the first file according to information is identical with the quantity of a plurality of the second data blocks that consist of the second file; If so, then obtain a plurality of crc proof test values of a plurality of the first data blocks and a plurality of the 2nd crc proof test values of a plurality of the second data blocks from HDFS from node; A plurality of crc proof test values and a plurality of the 2nd crc proof test value are compared respectively in order; If comparative result is identical, judge that then the first file is identical with the second file; And if comparative result is different, judges that then the first file is different with the second file.
Fig. 1 is the process flow diagram that is used for according to an embodiment of the invention the file comparision method of HDFS.
As shown in Figure 1, the file comparision method that is used for HDFS according to the embodiment of the invention comprises the steps.
Step S101 obtains the information of the first file and the second file from the host node of HDFS.
Particularly, this information comprises quantity and the position of a plurality of data blocks of the quantity of a plurality of data blocks that consist of the first file and position and formation the first file.In one embodiment of the invention, a plurality of the first data blocks and a plurality of the second data block size are 265M, only need the size of known certain file, can try to achieve the quantity of the data block that consists of this document.Need to prove that the data block size also can be set as the numerical value such as 64M, 128M, 512M, selecting herein 256M only is convenience for example, and and not as limiting to the invention.
Step S102, whether the quantity of a plurality of the first data blocks that relatively consists of the first file according to information is identical with the quantity of a plurality of the second data blocks that consist of the second file.
Particularly, because in the HDFS system, each data block size is scheduled to, thereby the data block quantity of configuration file is to judge whether identical first step of two files, only in the consistent situation of the data block quantity that consists of two files, just further carry out follow-up determining step.
Step S103 if so, then obtains a plurality of crc proof test values of a plurality of the first data blocks and a plurality of the 2nd crc proof test values of a plurality of the second data blocks from HDFS from node.
Particularly, through relatively learning of step S102 be compared the number of the first file and the data block of the second file identical after, the first file that further is compared from obtaining from node of HDFS again and a plurality of crc proof test values of the second file are used for the content of the first file and the second file is done further relatively.Wherein, the crc method of inspection has that error detecing capability is strong, expense is little, is easy to the advantage that realizes with scrambler and testing circuit.
In one embodiment of the invention, the length of a plurality of crc proof test values and a plurality of the 2nd crc proof test values is 2048 bytes, i.e. 2k.Need to prove, the length of the crc proof test value that each data block is corresponding also can be the numerical value such as 1k, 4k, 8k, the length of the crc proof test value that each data block is corresponding depends on the data block size and the kind of the CRC check method selected, data block is larger, corresponding crc proof test value is longer, adopts different crc method of calibration (such as crc12, crc16, crc32 etc.) also can draw the check code of different length.In the present embodiment, the length of determining a crc proof test value and a plurality of the 2nd crc proof test values is that 2048 bytes only are the convenience for example, and also not as limiting to the invention.
Step S104 compares a plurality of crc proof test values and a plurality of the 2nd crc proof test value respectively in order.
Particularly, be with consist of the first file the first data block crc proof test value and the first data block that consists of the second file the crc proof test value relatively, with consist of the first file the second data block crc proof test value and the second data block that consists of the second file the crc proof test value relatively, the like.
Step S105 if comparative result is identical, judges that then the first file is identical with the second file.
Particularly, the crc proof test value with all data blocks that consist of the second file is identical respectively if consist of the crc proof test value of all data blocks of the first file, judges that just the first file is identical with the second file.
Step S106 if comparative result is different, judges that then the first file is different with the second file.
Particularly, if it is different from the crc proof test value of a certain data block that consists of the second file to consist of the crc proof test value of a certain data block of the first file, judge that then the first file is different with the second file.
According to step S104-step S106, by successively more a plurality of crc proof test values and a plurality of the 2nd crc proof test value, the content of the first file and the second file is compared in order.Identical when a plurality of crc proof test values and a plurality of the 2nd crc proof test value, judge that then the first file is identical with the second file; When certain crc proof test value of judging the first file different from certain crc proof test value of the second file of corresponding order, judge that then the first file is different with the second file, and after the first file after a plurality of crc proof test values and second file of order a plurality of crc proof test values of order need not to continue to compare.
According to the file comparision method that is used for HDFS of above-described embodiment, can save transmission volume, and improve file efficient relatively.
For example, so that relatively file A and the file B of two 10G sizes illustrate advantage of the present invention.If adopt traditional direct comparison method, then need file A and file B are downloaded rear byte-by-byte comparison from HDFS, so the transmitted data on network amount is 10*2=20G, byte number relatively is 10G; If adopt the file comparision method of the above embodiment of the present invention, because of corresponding 40 data blocks of file of 10G size, the crc proof test value of the corresponding 2k size of each data block, so the transmitted data on network amount only is 2 * 40*2k=160k, byte number relatively only is 80k.
Fig. 2 is the process flow diagram that is used for according to an embodiment of the invention the file comparision method of HDFS.
Step S201 obtains the information of the first file and the second file from the host node of HDFS.
Particularly, this information comprises quantity and the position of a plurality of data blocks of the quantity of a plurality of data blocks that consist of the first file and position and formation the first file.In one embodiment of the invention, a plurality of the first data blocks and a plurality of the second data block size are 265M, only need the size of known certain file, can try to achieve the quantity of the data block that consists of this document.Need to prove that the data block size also can be set as the numerical value such as 64M, 128M, 512M, selecting herein 256M only is convenience for example, and and not as limiting to the invention.
Step S202, whether the quantity of a plurality of the first data blocks that relatively consists of the first file according to information is identical with the quantity of a plurality of the second data blocks that consist of the second file.
Particularly, because in the HDFS system, each data block size is scheduled to, thereby the data block quantity of configuration file is to judge whether identical first step of two files, only in the consistent situation of the data block quantity that consists of two files, just further carry out follow-up determining step.
Step S203 if so, then obtains a plurality of crc proof test values of a plurality of the first data blocks and a plurality of the 2nd crc proof test values of a plurality of the second data blocks from HDFS from node.
Particularly, through relatively learning of step S102 be compared the number of the first file and the data block of the second file identical after, the first file that further is compared from obtaining from node of HDFS again and a plurality of crc proof test values of the second file are used for the content of the first file and the second file is done further relatively.Wherein, the crc method of inspection has that error detecing capability is strong, expense is little, is easy to the advantage that realizes with scrambler and testing circuit.
In one embodiment of the invention, the length of a plurality of crc proof test values and a plurality of the 2nd crc proof test values is 2048 bytes, i.e. 2k.Need to prove, the length of the crc proof test value that each data block is corresponding also can be the numerical value such as 1k, 4k, 8k, the length of the crc proof test value that each data block is corresponding depends on the data block size and the kind of the CRC check method selected, data block is larger, corresponding crc proof test value is longer, adopts different crc method of calibration (such as crc12, crc16, crc32 etc.) also can draw the check code of different length.In the present embodiment, the length of determining a crc proof test value and a plurality of the 2nd crc proof test values is that 2048 bytes only are the convenience for example, and also not as limiting to the invention.
Step S204 compares a plurality of crc proof test values and a plurality of the 2nd crc proof test value respectively in order.
Particularly, be with consist of the first file the first data block crc proof test value and the first data block that consists of the second file the crc proof test value relatively, with consist of the first file the second data block crc proof test value and the second data block that consists of the second file the crc proof test value relatively, the like.
Step S205 if comparative result is identical, judges that then the first file is identical with the second file.
Particularly, the crc proof test value with all data blocks that consist of the second file is identical respectively if consist of the crc proof test value of all data blocks of the first file, judges that just the first file is identical with the second file.
Step S206 if comparative result is different, judges that then the first file is different with the second file.
Particularly, if it is different from the crc proof test value of a certain data block that consists of the second file to consist of the crc proof test value of a certain data block of the first file, judge that then the first file is different with the second file.
According to step S204-step S206, by successively more a plurality of crc proof test values and a plurality of the 2nd crc proof test value, the content of the first file and the second file is compared in order.Identical when a plurality of crc proof test values and a plurality of the 2nd crc proof test value, judge that then the first file is identical with the second file; When certain crc proof test value of judging the first file different from certain crc proof test value of the second file of corresponding order, judge that then the first file is different with the second file, and after the first file after a plurality of crc proof test values and second file of order a plurality of crc proof test values of order need not to continue to compare.
The file comparision method that is used for HDFS according to the embodiment of the invention further comprises step: at step S102, if it is not identical with the quantity of a plurality of the second data blocks that consist of the second file to judge the quantity of a plurality of the first data blocks that consist of the first file, judge that then the first file is different with the second file.This additional step shows: if compare the first file and the not of uniform size of the second file causes, then can need not to carry out subsequent step, need not to continue relatively its particular content, judge that directly the first file is different with the second file, further improve relative efficiency.
Fig. 3 is the process flow diagram that is used for according to an embodiment of the invention the file comparision method of HDFS.
As shown in Figure 3, the file comparision method that is used for HDFS according to the embodiment of the invention comprises the steps.
Step S301 obtains the information of the first file and the second file from the host node of HDFS.
Particularly, this information comprises quantity and the position of a plurality of data blocks of the quantity of a plurality of data blocks that consist of the first file and position and formation the first file.In one embodiment of the invention, a plurality of the first data blocks and a plurality of the second data block size are 265M, only need the size of known certain file, can try to achieve the quantity of the data block that consists of this document.Need to prove that the data block size also can be set as the numerical value such as 64M, 128M, 512M, selecting herein 256M only is convenience for example, and and not as limiting to the invention.
Step S302, whether the quantity of a plurality of the first data blocks that relatively consists of the first file according to information is identical with the quantity of a plurality of the second data blocks that consist of the second file.
Particularly, because in the HDFS system, each data block size is scheduled to, thereby the data block quantity of configuration file is to judge whether identical first step of two files, only in the consistent situation of the data block quantity that consists of two files, just further carry out follow-up determining step.。
Step S303 if so, then obtains a plurality of crc proof test values of a plurality of the first data blocks and a plurality of the 2nd crc proof test values of a plurality of the second data blocks from HDFS from node.
Particularly, through relatively learning of step S302 be compared the number of the first file and the data block of the second file identical after, the first file that further is compared from obtaining from node of HDFS again and a plurality of crc proof test values of the second file are used for the content of the first file and the second file is done further relatively.Wherein, the crc method of inspection has that error detecing capability is strong, expense is little, is easy to the advantage that realizes with scrambler and testing circuit.In one embodiment of the invention, the length of a plurality of crc proof test values and a plurality of the 2nd crc proof test values is 2048 bytes, i.e. 2k.Need to prove, the length of the crc proof test value that each data block is corresponding also can be the numerical value such as 1k, 4k, 8k, the length of the crc proof test value that each data block is corresponding depends on the data block size and the kind of the CRC check method selected, data block is larger, corresponding crc proof test value is longer, adopts different crc method of calibration (such as crc12, crc16, crc32 etc.) also can draw the check code of different length.In the present embodiment, the length of determining a crc proof test value and a plurality of the 2nd crc proof test values is that 2048 bytes only are the convenience for example, and also not as limiting to the invention.
Step S304 generates a plurality of first cryptographic hash corresponding with a plurality of crc proof test values and a plurality of second cryptographic hash corresponding with a plurality of the 2nd crc proof test values.
Particularly, hash algorithm is mapped as the less binary value of regular length with the binary value of random length, and this little binary value is called cryptographic hash, and cryptographic hash is the unique and extremely compact numeric representation form of one piece of data.Hash algorithm is very responsive to the plaintext of hash, and even only change one of them letter, Hash subsequently all will produce different values.In one embodiment of the invention, the length of a plurality of the first cryptographic hash and a plurality of the second cryptographic hash is 16 bytes.Need to prove that hash algorithm comprises multiple, the cryptographic hash length that obtains according to algorithms of different also can be the numerical value such as 32 bytes, 64 bytes, 128 bytes, and selecting herein 16 bytes only is convenience for example, and and not as limiting to the invention.
Step S305 compares a plurality of the first cryptographic hash and the second cryptographic hash respectively in order, judges whether the two is identical.Step S306 if comparative result is identical, judges that then the first file is identical with the second file;
And step S307, if comparative result is different, judge that then the first file is different with the second file.
According to step S305-step S307, by successively more a plurality of the first cryptographic hash and a plurality of the second cryptographic hash, the content of the first file and the second file is compared in order.Identical when a plurality of the first cryptographic hash and a plurality of the second cryptographic hash, judge that then the first file is identical with the second file; Certain first cryptographic hash is different from certain second cryptographic hash of corresponding order when judging, and judges that then the first file is different with the second file, and a plurality of second cryptographic hash of a plurality of first cryptographic hash of back order and back order need not to continue relatively.
In one embodiment of the invention, also further comprise step: among the step S302, if it is not identical with the quantity of a plurality of the second data blocks that consist of the second file to judge the quantity of a plurality of the first data blocks that consist of the first file, judge that then the first file is different with the second file.This step shows: if compare the first file and the not of uniform size of the second file causes, then can need not to carry out subsequent step, need not to continue relatively its particular content, judge that directly the first file is different with the second file, further improve relative efficiency.
According to the file comparision method that is used for HDFS of above-described embodiment, can further save transmission volume, and further improve file efficient relatively.
For example, so that relatively file A and the file B of two 10G sizes illustrate advantage of the present invention.If adopt traditional cryptographic hash relative method, then need file A and file B are made comparisons from the rear byte-by-byte calculating cryptographic hash of HDFS download again, so the transmitted data on network amount is 10*2=20G, the data volume that the calculating cryptographic hash is used is 20G, use the md5 algorithm to obtain the cryptographic hash of each file, this cryptographic hash is 33 bytes, then the cryptographic hash of two 33 bytes of byte-by-byte contrast; If adopt the file comparision method of the above embodiment of the present invention, the transmitted data on network amount is 160k only, the data volume that the calculating cryptographic hash is used is 160k, the proof test value of every 2k can convert the cryptographic hash of 16 bytes to according to the crc32 algorithm, thereby obtain two cryptographic hash that size is 640 bytes, the then cryptographic hash of two 640 bytes of byte-by-byte contrast.
The file comparison means that is used for HDFS according to the embodiment of the invention comprises: the acquisition of information module, and the acquisition of information module is used for obtaining from the host node of HDFS the information of the first file and the second file; The first comparison module, whether the first comparison module is identical with the quantity of a plurality of the second data blocks that consist of the second file for the quantity of a plurality of the first data blocks that relatively consist of the first file according to information; Crc proof test value acquisition module, crc proof test value acquisition module is used in the quantity of a plurality of the first data blocks that consist of the first file situation identical with the quantity of a plurality of second data blocks of formation the second file, obtains a plurality of crc proof test values of a plurality of the first data blocks and a plurality of the 2nd crc proof test values of a plurality of the second data blocks from HDFS from node; The second comparison module, the second comparison module are used for a plurality of crc proof test values and a plurality of the 2nd crc proof test value are compared respectively in order; And judge module, judge module is used for judging that when comparative result is identical the first file is identical with the second file, and comparative result does not judge simultaneously that or not first file is different with the second file.
Fig. 4 is the structured flowchart that is used for according to an embodiment of the invention the file comparison means of HDFS.
As shown in Figure 4, the file comparison means that is used for HDFS according to the embodiment of the invention comprises acquisition of information module 1, the first comparison module 2, crc proof test value acquisition module 3, the second comparison module 4 and judge module 5.
Particularly, acquisition of information module 1 is used for obtaining from the host node of HDFS the information of the first file and the second file.This information comprises number and the position of a plurality of data blocks of the number of a plurality of data blocks that consist of the first file and position and formation the first file.In one embodiment of the invention, a plurality of the first data blocks and a plurality of the second data block size are 265M, only need the size of known certain file, can try to achieve the quantity of the data block that consists of this document.Need to prove that the data block size also can be set as the numerical value such as 64M, 128M, 512M, selecting herein 256M only is convenience for example, and and not as limiting to the invention.
Whether the first comparison module 2 links to each other with acquisition of information module 1, identical with the quantity of a plurality of the second data blocks that consist of the second file for the quantity of a plurality of the first data blocks that relatively consist of the first file according to information.
Crc proof test value acquisition module 3 links to each other with the first comparison module 2, be used in the quantity of a plurality of the first data blocks that consist of the first file situation identical with the quantity of a plurality of second data blocks of formation the second file, obtain a plurality of crc proof test values of a plurality of the first data blocks and a plurality of the 2nd crc proof test values of a plurality of the second data blocks from HDFS from node.In one embodiment of the invention, the length of a plurality of crc proof test values and a plurality of the 2nd crc proof test values is 2048 bytes, i.e. 2k.Need to prove, the length of the crc proof test value that each data block is corresponding also can be the numerical value such as 1k, 4k, 8k, the length of the crc proof test value that each data block is corresponding depends on the data block size and the kind of the CRC check method selected, data block is larger, corresponding crc proof test value is longer, adopts different crc method of calibration (such as crc12, crc16, crc32 etc.) also can draw the check code of different length.In the present embodiment, the length of determining a crc proof test value and a plurality of the 2nd crc proof test values is that 2048 bytes only are the convenience for example, and also not as limiting to the invention.
The second comparison module 4 links to each other with crc proof test value acquisition module 3, is used for described a plurality of crc proof test values and described a plurality of the 2nd crc proof test value are compared respectively in order.
Judge module 5 links to each other with the second comparison module 4, be used for comparative result at the second comparison module 4 and judge that described the first file is identical with described the second file when identical, and the comparative result of the second comparison module 4 does not judge simultaneously that described the first file is different with described the second file.
According to the file comparison means that is used for HDFS of the embodiment of the invention, it is few to expend Internet Transmission, and the file relative efficiency is high.
Fig. 5 is the structured flowchart that is used for according to an embodiment of the invention the file comparison means of HDFS.
The file comparison means that is used for HDFS shown in Figure 5 is identical with the file comparison means structure that is used for HDFS shown in Figure 4, both are in difference: the judge module 5 in the present embodiment is except with the second comparison module 4 links to each other, also further link to each other with the first comparison module 2, particularly, in this embodiment, judge module 5 also is further used for judging that directly described the first file is different with described the second file in the quantity of a plurality of the first data blocks that consist of the first file situation not identical with the quantity of a plurality of the second data blocks that consist of the second file.According to the file comparison means that is used for HDFS of present embodiment, the file relative efficiency is further enhanced.
Fig. 6 is the structured flowchart that is used for according to an embodiment of the invention the file comparison means of HDFS.
As shown in Figure 6, the file comparison means that is used for HDFS according to the embodiment of the invention comprises acquisition of information module 1, the first comparison module 2, crc proof test value acquisition module 3, cryptographic hash generation module 6, the second comparison module 4 and judge module 5.
Particularly, acquisition of information module 1 is used for obtaining from the host node of HDFS the information of the first file and the second file.This information comprises number and the position of a plurality of data blocks of the number of a plurality of data blocks that consist of the first file and position and formation the first file.In one embodiment of the invention, a plurality of the first data blocks and a plurality of the second data block size are 265M, only need the size of known certain file, can try to achieve the quantity of the data block that consists of this document.Need to prove that the data block size also can be set as the numerical value such as 64M, 128M, 512M, selecting herein 256M only is convenience for example, and and not as limiting to the invention.
Whether the first comparison module 2 links to each other with acquisition of information module 1, identical with the quantity of a plurality of the second data blocks that consist of the second file for the quantity of a plurality of the first data blocks that relatively consist of the first file according to information.
Crc proof test value acquisition module 3 links to each other with the first comparison module 2, be used in the quantity of a plurality of the first data blocks that consist of the first file situation identical with the quantity of a plurality of second data blocks of formation the second file, obtain a plurality of crc proof test values of a plurality of the first data blocks and a plurality of the 2nd crc proof test values of a plurality of the second data blocks from HDFS from node.In one embodiment of the invention, the length of a plurality of crc proof test values and a plurality of the 2nd crc proof test values is 2048 bytes, i.e. 2k.Need to prove, the length of the crc proof test value that each data block is corresponding also can be the numerical value such as 1k, 4k, 8k, the length of the crc proof test value that each data block is corresponding depends on the data block size and the kind of the CRC check method selected, data block is larger, corresponding crc proof test value is longer, adopts different crc method of calibration (such as crc12, crc16, crc32 etc.) also can draw the check code of different length.In the present embodiment, the length of determining a crc proof test value and a plurality of the 2nd crc proof test values is that 2048 bytes only are the convenience for example, and also not as limiting to the invention.
Cryptographic hash generation module 6 links to each other with crc proof test value acquisition module 3, is used for generating a plurality of first cryptographic hash corresponding with a plurality of crc proof test values and a plurality of second cryptographic hash corresponding with a plurality of the 2nd crc proof test values.Particularly, hash algorithm is mapped as the less binary value of regular length with the binary value of random length, and this little binary value is called cryptographic hash.Cryptographic hash is the unique and extremely compact numeric representation form of one piece of data.If one section plaintext of hash is very responsive, and even only change a letter of this paragraph, Hash subsequently all will produce different values.In one embodiment of the invention, the length of a plurality of the first cryptographic hash and a plurality of the second cryptographic hash is 16 bytes.Need to prove that hash algorithm comprises multiple, the cryptographic hash length that obtains according to algorithms of different also can be the numerical value such as 32 bytes, 64 bytes, 128 bytes, and selecting herein 16 bytes only is convenience for example, and and not as limiting to the invention.
The second comparison module 4 links to each other with cryptographic hash generation module 6, is used for described a plurality of the first cryptographic hash and described a plurality of the second cryptographic hash are compared respectively in order.
Judge module 5 links to each other with the second comparison module 4, be used for comparative result at the second comparison module 4 and judge that described the first file is identical with described the second file when identical, and the comparative result of the second comparison module 4 does not judge simultaneously that described the first file is different with described the second file.In one embodiment of the invention, judge module 5 also links to each other with the first comparison module 2, judge module 5 also is further used for judging that directly described the first file is different with described the second file in the quantity of a plurality of the first data blocks that consist of the first file situation not identical with the quantity of a plurality of the second data blocks that consist of the second file.
According to the file comparison means that is used for HDFS of the embodiment of the invention, it is few to expend Internet Transmission, and the file relative efficiency is high.
Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, a plurality of steps or method can realize with being stored in the storer and by software or firmware that suitable instruction execution system is carried out.For example, if realize with hardware, the same in another embodiment, can realize with the combination of each or they in the following technology well known in the art: have for the discrete logic of data-signal being realized the logic gates of logic function, special IC with suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or the example in conjunction with specific features, structure, material or the characteristics of this embodiment or example description.In this manual, the schematic statement of above-mentioned term not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or characteristics can be with suitable mode combinations in any one or more embodiment or example.
Although illustrated and described embodiments of the invention, for the ordinary skill in the art, be appreciated that without departing from the principles and spirit of the present invention and can carry out multiple variation, modification, replacement and modification to these embodiment that scope of the present invention is by claims and be equal to and limit.

Claims (12)

1. a file comparision method that is used for HDFS is characterized in that, may further comprise the steps:
Obtain the information of the first file and the second file from the host node of HDFS;
Whether the quantity of a plurality of the first data blocks that relatively consists of the first file according to described information is identical with the quantity of a plurality of the second data blocks that consist of the second file;
If so, then obtain a plurality of crc proof test values of described a plurality of the first data blocks and a plurality of the 2nd crc proof test values of described a plurality of the second data blocks from HDFS from node;
Described a plurality of crc proof test values and described a plurality of the 2nd crc proof test value are compared respectively in order;
If comparative result is identical, judge that then described the first file is identical with described the second file; And
If comparative result is different, judge that then described the first file is different with described the second file.
2. the file comparision method for HDFS according to claim 1 is characterized in that, further comprises step:
Generate a plurality of first cryptographic hash corresponding with described a plurality of crc proof test values and a plurality of second cryptographic hash corresponding with described a plurality of the 2nd crc proof test values, wherein,
Described a plurality of the first cryptographic hash and described a plurality of the second cryptographic hash are compared respectively in order;
If comparative result is identical, judge that then described the first file is identical with described the second file; And
If comparative result is different, judge that then described the first file is different with described the second file.
3. the file comparision method for HDFS according to claim 1 and 2 is characterized in that, further comprises step:
If it is not identical with the quantity of a plurality of the second data blocks that consist of the second file to consist of the quantity of a plurality of the first data blocks of the first file, judge that then described the first file is different with described the second file.
4. the file comparision method for HDFS according to claim 1 and 2 is characterized in that,
Described a plurality of the first data block and described a plurality of the second data block size are 256M.
5. the file comparision method for HDFS according to claim 4 is characterized in that, the length of described a plurality of crc proof test values and described a plurality of the 2nd crc proof test values is 2048 bytes.
6. the file comparision method for HDFS according to claim 5 is characterized in that, the length of described a plurality of the first cryptographic hash and described a plurality of the second cryptographic hash is 16 bytes.
7. a file comparison means that is used for HDFS is characterized in that, comprising:
The acquisition of information module, described acquisition of information module is used for obtaining from the host node of HDFS the information of the first file and the second file;
The first comparison module, whether described the first comparison module is identical with the quantity of a plurality of the second data blocks that consist of the second file for the quantity of a plurality of the first data blocks that relatively consist of the first file according to described information;
Crc proof test value acquisition module, described crc proof test value acquisition module is used in the quantity of a plurality of the first data blocks that consist of the first file situation identical with the quantity of a plurality of second data blocks of formation the second file, obtains a plurality of crc proof test values of described a plurality of the first data blocks and a plurality of the 2nd crc proof test values of described a plurality of the second data blocks from HDFS from node;
The second comparison module, described the second comparison module are used for described a plurality of crc proof test values and described a plurality of the 2nd crc proof test value are compared respectively in order; And
Judge module, described judge module are used for judging that when comparative result is identical described the first file is identical with described the second file, and comparative result does not judge simultaneously that described the first file is different with described the second file.
8. the file comparison means for HDFS according to claim 7 is characterized in that, further comprises:
Cryptographic hash generation module, described cryptographic hash generation module are used for generating a plurality of first cryptographic hash corresponding with described a plurality of crc proof test values and a plurality of second cryptographic hash corresponding with described a plurality of the 2nd crc proof test values, wherein,
Described the second comparison module compares described a plurality of the first cryptographic hash and described a plurality of the second cryptographic hash respectively in order, and described judge module judges that when comparative result is identical described the first file is identical with described the second file, does not judge simultaneously that at comparative result described the first file is different with described the second file.
9. according to claim 7 or 8 described file comparison means for HDFS, it is characterized in that described judge module is different with described the second file for described the first file that judges in the quantity of a plurality of the first data blocks that consist of the first file situation not identical from the quantity of a plurality of the second data blocks that consist of the second file.
10. according to claim 7 or 8 described file comparison means for HDFS, it is characterized in that,
Described a plurality of the first data block and described a plurality of the second data block size are 256M.
11. the file comparison means for HDFS according to claim 10 is characterized in that the length of described a plurality of crc proof test values and described a plurality of the 2nd crc proof test values is 2048 bytes.
12. the file comparison means for HDFS according to claim 11 is characterized in that the length of described a plurality of the first cryptographic hash and described a plurality of the second cryptographic hash is 16 bytes.
CN201210130345.XA 2012-04-27 2012-04-27 File comparison method and device for HDFS (Hadoop Distributed File System) Active CN103377251B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210130345.XA CN103377251B (en) 2012-04-27 2012-04-27 File comparison method and device for HDFS (Hadoop Distributed File System)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210130345.XA CN103377251B (en) 2012-04-27 2012-04-27 File comparison method and device for HDFS (Hadoop Distributed File System)

Publications (2)

Publication Number Publication Date
CN103377251A true CN103377251A (en) 2013-10-30
CN103377251B CN103377251B (en) 2017-05-10

Family

ID=49462377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210130345.XA Active CN103377251B (en) 2012-04-27 2012-04-27 File comparison method and device for HDFS (Hadoop Distributed File System)

Country Status (1)

Country Link
CN (1) CN103377251B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537089A (en) * 2015-01-05 2015-04-22 北京数码大方科技股份有限公司 Method and device for detecting reliability of label data in computer aided design
CN104868973A (en) * 2014-02-21 2015-08-26 中国电信股份有限公司 Data integrity verifying method and system
CN107451108A (en) * 2017-06-13 2017-12-08 广州视源电子科技股份有限公司 Method and system for collaboratively editing document
CN110460487A (en) * 2019-06-25 2019-11-15 网宿科技股份有限公司 The monitoring method and system of service node, service node
CN110460486A (en) * 2019-06-25 2019-11-15 网宿科技股份有限公司 The monitoring method and system of service node
CN117725026A (en) * 2023-08-14 2024-03-19 荣耀终端有限公司 Repeated file searching method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290628A (en) * 2008-06-17 2008-10-22 中兴通讯股份有限公司 Data file updating storage method
US20090271447A1 (en) * 2008-04-28 2009-10-29 Shin Kang Soo Method for synchronizing contents file and device for employing the same
CN101807207A (en) * 2010-03-22 2010-08-18 北京大用科技有限责任公司 Method for sharing document based on content difference comparison

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271447A1 (en) * 2008-04-28 2009-10-29 Shin Kang Soo Method for synchronizing contents file and device for employing the same
CN101290628A (en) * 2008-06-17 2008-10-22 中兴通讯股份有限公司 Data file updating storage method
CN101807207A (en) * 2010-03-22 2010-08-18 北京大用科技有限责任公司 Method for sharing document based on content difference comparison

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104868973A (en) * 2014-02-21 2015-08-26 中国电信股份有限公司 Data integrity verifying method and system
CN104868973B (en) * 2014-02-21 2018-09-11 中国电信股份有限公司 Data integrity verifying method and system
CN104537089A (en) * 2015-01-05 2015-04-22 北京数码大方科技股份有限公司 Method and device for detecting reliability of label data in computer aided design
CN104537089B (en) * 2015-01-05 2018-03-30 北京数码大方科技股份有限公司 Labeled data reliability checking method and device in CAD
CN107451108A (en) * 2017-06-13 2017-12-08 广州视源电子科技股份有限公司 Method and system for collaboratively editing document
CN110460487A (en) * 2019-06-25 2019-11-15 网宿科技股份有限公司 The monitoring method and system of service node, service node
CN110460486A (en) * 2019-06-25 2019-11-15 网宿科技股份有限公司 The monitoring method and system of service node
CN110460486B (en) * 2019-06-25 2022-08-05 网宿科技股份有限公司 Service node monitoring method and system
CN117725026A (en) * 2023-08-14 2024-03-19 荣耀终端有限公司 Repeated file searching method and electronic equipment

Also Published As

Publication number Publication date
CN103377251B (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN103377251A (en) File comparison method and device for HDFS (Hadoop Distributed File System)
CN107040582B (en) Data processing method and device
CN109889505B (en) Data consistency verification method and terminal equipment
CN111090645B (en) Cloud storage-based data transmission method and device and computer equipment
CN110347651B (en) Cloud storage-based data synchronization method, device, equipment and storage medium
US11424760B2 (en) System and method for data compaction and security with extended functionality
CN101655821B (en) Method and apparatus for settling Hash address conflict when mapping address space
CN110083606A (en) Across chain storage method, terminal and storage medium
CN103226593A (en) File system management method and file storage terminal thereof
KR20160140381A (en) Compressor and Method for Variable-Rate Texture Compression
US20240126436A1 (en) System and method for codebook-based data encoding
CN110633198A (en) Block chain-based software test data storage method and system
CN106685429B (en) Integer compression method and device
CN105553937A (en) System and method for data compression
US12079474B2 (en) System and method for data compaction and encryption of anonymized data records
CN106788891A (en) A kind of optimal partial suitable for distributed storage repairs code constructing method
CN109799948A (en) A kind of date storage method and device
CN104391759A (en) Data archiving method for load sensing in erasure code storage
CN109471642A (en) Firmware generates storage method and device, firmware start method and device
WO2019001436A1 (en) Polar code encoding method and device
CN105007286A (en) Decoding method, decoding device, and cloud storage method and system
CN101572693A (en) Equipment and method for parallel mode matching
CN112711631A (en) Digital twin information synchronization method, system, readable storage medium and device
JP6992309B2 (en) Transmitter, receiver, and communication method
CN106354581B (en) A kind of cyclic redundancy check method and multi-core processor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant