CN100458796C - File classifying method and file classifier - Google Patents
File classifying method and file classifier Download PDFInfo
- Publication number
- CN100458796C CN100458796C CNB2007100994040A CN200710099404A CN100458796C CN 100458796 C CN100458796 C CN 100458796C CN B2007100994040 A CNB2007100994040 A CN B2007100994040A CN 200710099404 A CN200710099404 A CN 200710099404A CN 100458796 C CN100458796 C CN 100458796C
- Authority
- CN
- China
- Prior art keywords
- characteristic
- division
- classification
- file
- sort file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 238000004458 analytical method Methods 0.000 claims description 141
- 238000005070 sampling Methods 0.000 abstract 4
- 230000008569 process Effects 0.000 description 9
- 230000008859 change Effects 0.000 description 4
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method for classifying file includes setting at least one classification character to obtain all combinations of classification character-sampling, analyzing bit stream of file to be classified to obtain combination of classification character-sampling for file to be classified and confirming classification of file to be classified according to obtained all combinations of classification character-sampling and obtained combination of classification character-sampling for file to be classified. The file sorter used for realizing said method is also disclosed.
Description
Technical field
The present invention relates to the document classification technology, particularly file classifying method and document sorter.
Background technology
Traditional file classifying method normally carries out document classification according to the extension name of file, that is: with extension name as characteristic of division, the All Files that will have identical extension name according to the extension name for the treatment of sort file is classified as a class.This file classifying method is a kind of rough sort, and needs more sophisticated category in the practical application usually, and therefore, the classification that adopts this document sorting technique to be distinguished all can not satisfy requirement of actual application under many circumstances.And this method requires the sort file that remains all to have extension name, if certain file does not have extension name, then this method can't be classified to this document.
A kind of improved method at above-mentioned classic method proposition is: according in the practical application taxonomical hierarchy and the characteristic of division corresponding with each taxonomical hierarchy being formulated in the requirement of classification fineness, realize meeting the document classification of this taxonomical hierarchy requirement and characteristic of division requirement layer by layer by programming.
Here, characteristic of division is meant: can characterize the index of certain form characteristic of certain file, be the file attributes that file is carried out branch time-like institute foundation, and for example, extension name, channel number, compressed format etc. all can be used as characteristic of division.For " extension name " this characteristic of division, its value can be wave, bmp, mp3 etc., the value of " channel number " this characteristic of division can be monophony and two-channel, and the value of " compressed format " this characteristic of division can be the pulse code modulation (PCM) (PCM_MS) of Microsoft, the adaptive audio pulse code modulation (PCM) (ADPCM_MS) of Microsoft etc.
Adopt said method from treat sort file, to tell the file that meets the requirement of characteristic of division value according to the needs of practical application.But this method exists taxonomical hierarchy to fix, characteristic of division is fixed, mode classification is dumb, the defective of poor expandability.For example, if realize the wave file is classified by channel number, further think arbitrarily to divide time-like by compression form or bit wide again, or want after by the compression format classification, to carry out the branch time-like by channel number again, could realize with regard to needing special developer that source code is carried out a large amount of modifications.Because the implementation of original source code is: at first distinguish the file that the extension name value is wave, will meet the file area that the channel number value requires according to this characteristic of division of channel number then and branch away according to this characteristic of division of file extension.Think further arbitrarily to divide time-like after this and work as us, just need to increase corresponding source code by compression form value or bit wide value.And this method requires also to treat that sort file has extension name.
As seen from the above analysis, existing file sorting technique taxonomical hierarchy is fixed, characteristic of division is fixed, and causes it not carry out document classification, poor expandability according to different application requirements.And, can not classify to the file that does not have extension name, the classification degree of accuracy is not high.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of file classifying method, flexible to realize, accurate document classification.
Another fundamental purpose of the present invention is to provide a kind of document sorter, flexible to realize, accurate document classification.
For achieving the above object, technical scheme of the present invention specifically is achieved in that
A kind of file classifying method may further comprise the steps:
At least one characteristic of division is set, obtains all combinations of described characteristic of division value;
Treat the bit stream of sort file and analyze, obtain the described combination for the treatment of the characteristic of division value of sort file;
According to all combination and described combinations for the treatment of the characteristic of division value of sort file of described characteristic of division value, determine the described classification for the treatment of sort file.
Further, at least one analysis rule can be set, described each analysis rule is used for the bit stream for the treatment of sort file that meets same characteristic of division is analyzed;
Described bit stream analysis for the treatment of sort file is: call described analysis rule successively and treat the bit stream of sort file and analyze.
Further, can be provided with and call indication, describedly call indication and be used to determine the analysis rule that called;
Described call analysis rule successively before, further judge whether to be provided with the described indication of calling, if be provided with, then call with the described corresponding analysis rule of indication that calls and treat the bit stream of sort file and analyze; Otherwise, continue to carry out the operation of calling analysis rule successively.
Described call analysis rule successively before, can further judge in the characteristic of division that analysis rule institute foundation is set whether have the characteristic of division that conforms to the described characteristic of division for the treatment of sort file;
If exist, then call the bit stream that sort file is treated in the pairing analysis rule analysis of the described characteristic of division that conforms to; Otherwise, continue to carry out the operation of calling analysis rule successively.
After all combinations that obtain described characteristic of division value, all combinations of described characteristic of division value and the corresponding relation of classification logotype can be set further;
Described definite described classification of sort file for the treatment of is: according to all combinations of described characteristic of division value and the corresponding relation and the described combination for the treatment of the characteristic of division value of sort file of classification logotype, determine the described classification logotype for the treatment of sort file.
Further, the taxonomical hierarchy corresponding with described characteristic of division can be set;
In all combinations of described characteristic of division value value put in order and the combination of the described characteristic of division value for the treatment of sort file in putting in order of value meet putting in order of described taxonomical hierarchy.
The split catalog that puts in order that can meet further, described taxonomical hierarchy for each classification setting;
After determining to treat the classification of sort file, further with described treat sort file deposit in described classification respective classified catalogue in.
A kind of document sorter comprises: classification setting module, control module and analysis module;
Described classification setting module is used to be provided with characteristic of division;
Described control module is used for obtaining according to characteristic of division all combinations of characteristic of division value, and sends to analysis module;
Described analysis module, be used to analyze the bit stream for the treatment of sort file, obtain the described combination for the treatment of the characteristic of division value of sort file, and according to all combinations of described characteristic of division value, with the described combination for the treatment of the characteristic of division value of sort file, determine the described classification for the treatment of sort file, described classification is returned to control module.
Further can comprise at least one analytic unit in the described analysis module;
Described analytic unit is used for the bit stream for the treatment of sort file that meets same characteristic of division is analyzed, if analyze successfully, then returns the classification for the treatment of sort file, otherwise returns the failure sign;
Described control module is used for calling successively described analytic unit, describedly treats the classification of sort file or obtains the failure sign until determining.
Described classification setting module can be further used for being provided with and calls indication, and will describedly call to indicate and send to control module; Describedly call indication and be used to determine the analysis rule that called;
Described control module is further used for calling indication according to described, calls and the described corresponding analytic unit of indication that calls.
May further include in the described document sorter: judge module;
Described judge module, be used for judging and each analytic unit respective classified feature, whether there is the characteristic of division that conforms to the described characteristic of division for the treatment of sort file,, then notifies described control module to call and the corresponding analytic unit of the described characteristic of division that conforms to if exist;
Described control module is further used for calling and the corresponding analytic unit of the described characteristic of division that conforms to according to the notice of described judge module.
May further include in the described document sorter: the sort operation module;
Described control module, be further used for being provided with all combinations of described characteristic of division value and the corresponding relation of classification logotype, and meet the split catalog that described taxonomical hierarchy puts in order, and be used for having determined with the corresponding relation of described classification logotype and split catalog, that the sort file for the treatment of of classification logotype sends to described sort operation module for each classification logotype setting;
Described sort operation module is used for the classification logotype for the treatment of sort file of having determined classification logotype according to described, with described treat sort file deposit in described classification logotype respective classified catalogue in.
As seen from the above technical solution, technical scheme disclosed in this invention at first is provided with at least one characteristic of division, obtains all combinations of described characteristic of division value; Then, treat the bit stream of sort file and analyze, obtain the described combination for the treatment of the characteristic of division value of sort file; At last, according to all combination and described combinations for the treatment of the characteristic of division value of sort file of described characteristic of division value, determine the described classification for the treatment of sort file.So, owing to treat that by analysis the sort file bit stream can access the value of all characteristic of divisions of this document, combination that the classification of sort file is based on the characteristic of division value carries out and the present invention treats, therefore, when characteristic of division changes, need not to revise source code, this method just can obtain all combinations of respective classified feature value automatically according to new characteristic of division, and treat sort file with reference to these all combinations and classify, thereby realized flexible, accurate document classification.
Description of drawings
Fig. 1 is the exemplary process diagram of file classifying method of the present invention.
Fig. 2 is the schematic flow sheet of file classifying method in the embodiment of the invention one.
Fig. 3 is the schematic flow sheet of file classifying method in the embodiment of the invention two.
Fig. 4 is the schematic flow sheet of file classifying method in the embodiment of the invention three.
Fig. 5 is the composition structural representation of document sorter of the present invention.
Embodiment
For making purpose of the present invention, technical scheme and advantage clearer, below with reference to the accompanying drawing embodiment that develops simultaneously, the present invention is described in further detail.
Main thought of the present invention is: according to taxonomical hierarchy and with each taxonomical hierarchy respective classified feature, setting meets all combinations and the corresponding relation of classification logotype of the characteristic of division value of described taxonomical hierarchy, and treat that by analysis the bit stream (bit stream) of sort file obtains treating the value of each characteristic of division of sort file, at last, determine to treat the classification logotype of sort file according to all combinations of set characteristic of division value and the corresponding relation of classification logotype.
Because, treat that by analysis the sort file bit stream can access the value of all characteristic of divisions of this document, combination that the classification of sort file is based on the characteristic of division value carries out and the present invention treats, therefore, when taxonomical hierarchy and characteristic of division change, need not to revise source code, this method is carried out document classification with regard to can be automatically according to new taxonomical hierarchy and characteristic of division corresponding classification logotype being set, thereby has realized flexibly, accurate document classification.
Fig. 1 is the exemplary process diagram of file classifying method of the present invention.Referring to Fig. 1, this method may further comprise the steps:
Step 101: at least one characteristic of division is set, obtains all combinations of described characteristic of division value;
Step 102: the bit stream for the treatment of sort file is analyzed, and obtains the described combination for the treatment of the characteristic of division value of sort file;
Step 103:, determine the described classification for the treatment of sort file according to all combination and described combinations for the treatment of the characteristic of division value of sort file of described characteristic of division value.
So far, finish the exemplary flow of file classifying method of the present invention.
In the method shown in Figure 1, can specify characteristic of division by the user.Characteristic of division of the present invention is meant: can characterize the index of certain form characteristic of certain file, for example, certain specific character of extension name, coded format or a certain class file etc. all can be used as characteristic of division of the present invention.Here, because document classification is exactly to judge to treat whether sort file meets some characteristic index of existing file form, it is based on, and the existing file form carries out, and the index that characterizes the characteristic of certain existing file form can be determined according to relevant criterion of the prior art, therefore, the present invention can list various characteristic of divisions for user's selection.And,, make that document classification of the present invention is more accurate owing to can carry out document classification according to all characteristic index.
In the method shown in Figure 1, treating the preferred approach that the bit stream of sort file analyzes can have following three kinds:
First method:
On the basis of method shown in Figure 1, preferably can construct at least one analysis rule, the analysis rule of being constructed is used for the bit stream for the treatment of sort file that meets same characteristic of division is analyzed, and just the bit stream for the treatment of sort file that meets some specific characteristic of divisions is analyzed.Here, if analysis rule is treated the bit stream analysis success of sort file, can return the classification for the treatment of sort file; If the classification failure can be returned the failure sign.
At this moment, analysis treats that the method for the bit stream of sort file can be: call the bit stream that sort file is treated in the analysis rule analysis successively.
In this above-mentioned preferred version, all can all can be used as the characteristic of division that analysis rule institute foundation is set as the characteristic index of characteristic of division.For example, when with extension name when the characteristic of division of analysis rule institute foundation is set, can construct corresponding analysis rule at each extension name, by each analysis rule the bit stream of file with respective extension name value is analyzed.For example, can construct corresponding analysis rule respectively at the file of expansion wave, mp3 by name and bmp.
Second method:
On the basis of above-mentioned first kind of preferred version, can also further be provided with and call indication, this calls indication and is used to determine whether to call some analysis rules and treats the bit stream of sort file and analyze, and is used to determine the analysis rule that called.
Before calling analysis rule successively, if judge that being provided with this calls indication, then can directly call and call the corresponding analysis rule analysis of indication with this and treat the bit stream of sort file, rather than call the bit stream that each analysis rule analysis is treated sort file successively.Do not have only when this is set and call when indication, the just execution operation of calling analysis rule successively.
Like this, under the situation of certain characteristic of division value that can determine to treat sort file, can reduce the time of carrying out document classification, improve the efficient of document classification.For example, in the time can determining that certain file is the bmp file, when the value that promptly can determine certain this characteristic of division of file extension is bmp, just can be provided with and calls indication, directly indication is called the bmp analysis rule and is treated sort file and carry out document classification.
The third method:
This method can be used in above-mentioned first kind of preferred version and call successively before the analysis rule, also can be used in above-mentioned second kind of preferred version to judge to be provided with and call after the rule, call before the analysis rule successively.This method is according to the characteristic of division of tectonic analysis rule time institute's foundation, judge whether to exist the characteristic of division that conforms to the characteristic of division for the treatment of sort file, if exist, then directly call the pairing analysis rule analysis of this characteristic of division that conforms to and treat the bit stream of sort file, rather than call the bit stream that each analysis rule analysis is treated sort file successively.Have only when not having the characteristic of division that conforms to, just carry out the operation of calling analysis rule successively.So also can reduce the time of carrying out document classification, improve the efficient of document classification.
In the step 101 shown in Figure 1, after all combinations that obtain the characteristic of division value, all combinations of described characteristic of division value and the corresponding relation of classification logotype can also be set further; So, when determining to treat the classification of sort file, just can and treat the combination of the characteristic of division value of sort file according to all combinations of described characteristic of division value and the corresponding relation of classification logotype, determine to treat the classification logotype of sort file, identify as the classification for the treatment of sort file with this classification.
Preferably, the taxonomical hierarchy corresponding with described characteristic of division can be set in above-mentioned steps 101; At this moment, in all combinations of the resulting characteristic of division value of step 101 value put in order and the combination of the determined characteristic of division value for the treatment of sort file of step 102 in putting in order of value should meet putting in order of this taxonomical hierarchy.
In the file classifying method provided by the present invention, can also further meet the split catalog that described taxonomical hierarchy puts in order for each classification or classification logotype setting; After determining to treat the classification or classification logotype of sort file, just can with determine classification or classification logotype treat sort file deposit in its classification or classification logotype respective classified catalogue in.Handle like this, feasible management and use to sorted file becomes easier.
Embodiment below by three preferred embodiment explanation file classifying methods of the present invention.
Embodiment one:
Present embodiment is introduced above-mentioned first kind of preferred approach, and as the characteristic of division that classifying rules institute foundation is set, being provided with respectively at the extension name value is the analysis rule of files such as wave, mp3 and bmp with extension name.
Fig. 2 is the schematic flow sheet of file classifying method in the embodiment of the invention one.Referring to Fig. 2, this method may further comprise the steps:
Step 201: taxonomical hierarchy and the characteristic of division corresponding with each taxonomical hierarchy are set, obtain all combinations of characteristic of division value.
In this step, be provided with after taxonomical hierarchy and the characteristic of division corresponding, can obtain meeting all combinations of the characteristic of division value that described taxonomical hierarchy puts in order with each taxonomical hierarchy.
With the wave file is example, supposes that the user specifies to carry out three-layer classification, and promptly taxonomical hierarchy is 3; The characteristic of division of ground floor is a coded format, and the characteristic of division of the second layer is a channel number, and the 3rd layer characteristic of division is a bit wide.
Because the value of this characteristic of division of coded format can be: PCM_MS, ADPCM_MS etc.;
The value of this characteristic of division of channel number can be: monophony and two-channel;
The value of this characteristic of division of bit wide can be: 8,16 etc.;
Therefore, can obtain all combinations that meet the characteristic of division value that taxonomical hierarchy puts in order as shown in table 1 according to above-mentioned characteristic of division:
Extension name | Coded format | Channel number | Bit wide |
WAVE | PCM_MS | 1 | 8 |
WAVE | PCM_MS | 1 | 16 |
WAVE | PCM_MS | 2 | 8 |
WAVE | PCM_MS | 2 | 16 |
WAVE | ADPCM_MS | 1 | 8 |
WAVE | ADPCM_MS | 1 | 16 |
WAVE | ADPCM_MS | 2 | 8 |
WAVE | ADPCM_MS | 2 | 16 |
WAVE | … | … | … |
Table 1
At above-mentioned all combinations that meet the characteristic of division value that taxonomical hierarchy puts in order, classification logotype one to one can be set with it.Set classification logotype can be any form, for example: numeral, letter, literal and combination thereof etc.Preferably, can classification logotype be set to shape as the form of " extension name _ ground floor characteristic of division value _ second layer characteristic of division value _ three-layer classification eigenwert ".As follows according to the corresponding classification logotype of combination that this preferred mode is provided with the value of characteristic of division shown in the table 1:
WAVE_PCM_MS_1_8
WAVE_PCM_MS_1_16
WAVE_PCM_MS_2_8
WAVE_PCM_MS_2_16
WAVE_ADPCM_MS_1_8
WAVE_ADPCM_MS_1_16
WAVE_ADPCM_MS_2_8
WAVE_ADPCM_MS_2_16
……
In the present embodiment, set above-mentioned classification logotype is designated as FormatID.
Step 202: call first analysis rule and treat sort file analysis.
In this step, analysis rule will be treated the bit stream of sort file and analyze according to the existing relevant criterion of corresponding with it characteristic of division.Be example with the wave file below, introduce and how to analyze the bit stream for the treatment of sort file.
The wave file is as one of wave file form that uses in the multimedia, and it is a standard with the RIFF form.According to the relevant criterion of existing RIFF form, can analyze wave file bit stream according to following steps, obtain the value that certain treats the characteristic of division of sort file:
The 1st step: from the file bit stream, read preceding 4 bytes, if these four bytes are " RIFF ", then continue next step, otherwise return the analysis failure flags.
The 2nd step: continue sequentially from the file bit stream, to read 4 bytes, these 4 table of bytes prescribed paper length.
The 3rd step: continue sequentially from the file bit stream, to read 4 bytes, if these four bytes are " WAVE ", then continue next step, otherwise return the analysis failure flags.
The 4th step: continue sequentially from the file bit stream, to read 4 bytes, this value is designated as ckID.
The 5th step: continue sequentially from the file bit stream, to read 4 bytes, this value is designated as ckSize.
The 6th step: if ckID is 0x20746D66 for the fmt mark, then:
1) sequentially read 2 bytes, these two byte representation FORMAT_TAG, " coded format " this characteristic of division promptly as mentioned above, its value can be WAVE_PCM_MS, WAVE_ADPCM_MS etc.;
2) sequentially read 2 bytes, these two values that byte is exactly " channel number " this characteristic of division.
3) skip 10 bytes.
4) sequentially read 2 bytes, these two the required bit wides of each sample of byte representation, i.e. values of " bit wide " this characteristic of division.
5) analyze successfully, return the value of each characteristic of division and withdraw from.
If ckID is not the fmt mark, then skips ckSize byte, and jump to step 4.
In the process of above-mentioned Study document bit stream,, also will return the analysis failure flags if successfully before run into document flow end analyzing.
Set analysis rule analysis treats that the method for sort file bit stream and said method are roughly the same at different characteristic of divisions, and difference only is the standard difference that each analysis rule is followed when the Study document bit stream.For example, the bmp analysis rule adopts the relevant criterion of bmp file to carry out the bit stream analysis, and the mp3 analysis rule adopts the relevant criterion of mp3 file to carry out the bit stream analysis.
In order to improve the efficient of document classification, can finish to treat the analysis of sort file bit stream immediately can determining step during the value of 201 set characteristic of divisions.
When the bit stream for the treatment of sort file is analyzed, determine the value of these three kinds of characteristic of divisions of coded format, channel number and bit wide, and when obtaining this FormatID that treats sort file, just can return this FormatID to routine call place of calling the wave analysis rule according to the corresponding relation of the combination of the set characteristic of division value of step 201 and classification logotype.
When the bit stream for the treatment of sort file is analyzed, do not obtain the value of at least one characteristic of division in these three kinds of characteristic of divisions of coded format, channel number and bit wide, though or obtain the value of these three kinds of characteristic of divisions, but in the time of can't obtaining this FormatID that treats sort file according to the corresponding relation of the combination of the set characteristic of division value of step 201 and classification logotype, can return the failure sign to routine call place of calling the wave analysis rule.
Step 203: the result who returns according to the analysis rule that is called judges whether to analyze successfully, if analyze successfully, then continues execution in step 204, otherwise, continue execution in step 205.
That analysis rule returns if be called in step 202 is FormatID, and then decision analysis success continues the classification operation in the execution in step 204; What analysis rule returned if be called in step 202 is the failure sign, and then decision analysis failure continues execution in step 205.
Step 204: sort out to analyzing the successful sort file for the treatment of.
In the present embodiment, can the respective classified catalogue be set, and this split catalog to be to meet the putting in order to good of described taxonomical hierarchy, shape promptly preferably can be set as the catalogue of " extension name/coded format/channel number " for each classification logotype.Set FormatID is an example with step 201, and the FormatID that is provided with according to this preferred mode and the corresponding relation of split catalog are as follows:
The FormatID split catalog
WAVE_PCM_MS_1_8 WAVE/PCM_MS/Channel1
WAVE_PCM_MS_1_16 WAVE/PCM_MS/Channel1
WAVE_PCM_MS_2_8 WAVE/PCM_MS/Channel2
WAVE_PCM_MS_2_16 WAVE/PCM_MS/Channel2
WAVE_ADPCM_MS_1_8 WAVE/ADPCM_MS/Channel1
WAVE_ADPCM_MS_1_16 WAVE/ADPCM_MS/Channel1
WAVE_ADPCM_MSS_2_8 WAVE/ADPCM_MS/Channel2
WAVE_ADPCM_MS_2_16 WAVE/ADPCM_MS/Channel2
……
In this step, can according to above-mentioned corresponding relation will determine classification logotype treat sort file deposit in its classification logotype respective classified catalogue in.Handle like this, feasible management and use to sorted file becomes easier.So far, finish present embodiment certain is treated the classification of sort file, process ends.
Step 205: judge whether to exist not invoked analysis rule,, then continue execution in step 206 if exist; Otherwise, continue execution in step 207.
In this step,, then continue the operation of calling next analysis rule in the execution in step 206 if also there is not invoked analysis rule, otherwise, continue in the execution in step 207 and the relevant subsequent operation of classification failure.
Step 206: call next analysis rule, and return step 203.
Step 207: treat sort file analysis failure, carry out and the relevant subsequent operation of classification failure.
In this step, can handle the situation of classification failure, for example, the sort file for the treatment of of classification failure can be deposited under the specific directory, and show corresponding prompt etc., not repeat them here to the user according to method same as the prior art.
So far, finish present embodiment certain is treated the classification of sort file, process ends.
As seen from the above-described embodiment, present embodiment at first be provided with taxonomical hierarchy and with each taxonomical hierarchy respective classified feature, obtain all combinations of characteristic of division value and the corresponding relation of classification logotype, then, by calling the bit stream that sort file is treated in the analysis rule analysis successively, obtain treating the value of the characteristic of division of sort file, last, determine to treat the classification logotype of sort file according to all combinations and the corresponding relation of classification logotype of set characteristic of division value.So, owing to treat that by analysis the sort file bit stream can access the value of all characteristic of divisions of this document, combination that the classification of sort file is based on the characteristic of division value carries out and the present invention treats, therefore, when taxonomical hierarchy and/or characteristic of division change, need not to revise source code, this method is carried out document classification with regard to can be automatically according to new taxonomical hierarchy and characteristic of division corresponding classification logotype being set, thereby has realized flexibly, accurate document classification.
Embodiment two:
Present embodiment is introduced above-mentioned second kind of preferred approach, also with extension name as the characteristic of division that classifying rules institute foundation is set, being provided with respectively at the extension name value is the analysis rule of files such as wave, mp3 and bmp, and supposes to know in advance and treat that sort file is the wave file.
Fig. 3 is the schematic flow sheet of file classifying method in the embodiment of the invention two.Referring to Fig. 3, this method may further comprise the steps:
Step 301: taxonomical hierarchy and the characteristic of division corresponding with each taxonomical hierarchy are set, and are provided with and call indication, obtain all combinations of characteristic of division value.
In this step, set taxonomical hierarchy and the characteristic of division corresponding with each taxonomical hierarchy, and all combinations of resulting characteristic of division value are all identical with embodiment one step 201, do not repeat them here.Be with embodiment one step 201 difference: be provided with in this step and call the wave analysis rule and treat the indication that the bit stream of sort file is analyzed.
Step 302: judge whether to be provided with and call indication, if then continue execution in step 303; Otherwise, to call analysis rule successively according to embodiment one step 202 to the described operation of step 207 and treat the bit stream of sort file and analyze, classification or classification failure until determining to treat sort file do not repeat them here.
Step 303: directly call the analysis rule corresponding with calling indication.
Because the hypothesis user knows in advance and treats that sort file is the wave file in the present embodiment, then in step 301, can be provided with and call indication, indication call with the extension name value be wave respective classified rule.In this step, will directly call the wave analysis rule according to this indication and analyze and treat sort file.
This step is identical with step 202, when obtaining treating the FormatID of sort file, will return FormatID; Otherwise, will return the failure sign.
Step 304: whether analyze success according to the discriminatory analysis as a result that the analysis rule that is called returns,, then continue execution in step 305 if analyze successfully, otherwise, continue execution in step 306.
That analysis rule returns if be called in step 303 is FormatID, and then decision analysis success continues the classification operation in the execution in step 305; What analysis rule returned if be called in step 303 is the failure sign, and then decision analysis failure continues execution in step 306.
Step 305: sort out to analyzing the successful sort file for the treatment of.
In this step, can sort out analyzing the successful sort file for the treatment of, not repeat them here with reference to the mode of embodiment one step 204.So far, finish present embodiment certain is treated the classification of sort file, process ends.
Step 306: treat sort file analysis failure, carry out and the relevant subsequent operation of classification failure.
In this step, can handle the situation of classification failure, for example, the sort file for the treatment of of classification failure can be deposited under the specific directory, and show corresponding prompt etc., not repeat them here to the user according to method same as the prior art.
So far, finish this method certain is treated the classification of sort file, process ends.
As seen from the above-described embodiment, present embodiment at first be provided with taxonomical hierarchy and with each taxonomical hierarchy respective classified feature, and indication is called in setting, obtain all combinations of characteristic of division value and the corresponding relation of classification logotype, then, call the bit stream that the corresponding analysis rule analysis of indication is treated sort file by calling with this, obtain treating the value of the characteristic of division of sort file, at last, determine to treat the classification logotype of sort file according to all combinations of set characteristic of division value and the corresponding relation of classification logotype.So, flexible, accurate document classification not only can be realized, the efficient of document classification can also be improved, the time of saving document classification.
Embodiment three:
Present embodiment is introduced above-mentioned the third preferred approach, also with extension name as the characteristic of division that classifying rules institute foundation is set, being provided with respectively at the extension name value is the analysis rule of files such as wave, mp3 and bmp, and supposes to know in advance and treat that sort file is the wave file.
Fig. 4 is the schematic flow sheet of file classifying method in the embodiment of the invention three.Referring to Fig. 4, this method may further comprise the steps:
Step 401: taxonomical hierarchy and the characteristic of division corresponding with each taxonomical hierarchy are set, obtain all combinations of characteristic of division value.
In this step, set taxonomical hierarchy and the characteristic of division corresponding with each taxonomical hierarchy, and all combinations of resulting characteristic of division value are all identical with embodiment one step 201, do not repeat them here.
Step 402: judge whether to be provided with and call indication, call indication if be provided with, then directly call and call the corresponding analysis rule of indication and treat the bit stream of sort file and analyze according to embodiment two steps 303 to 306 described operations, classification or classification failure until determining to treat sort file do not repeat them here; If be not provided with and call indication, then continue execution in step 403.
Step 403: judge in the characteristic of division of tectonic analysis rule institute foundation, whether have the characteristic of division that conforms to the characteristic of division for the treatment of sort file,, then continue execution in step 404 if exist; Treat the bit stream of sort file and analyze otherwise call analysis rule successively according to embodiment one step 202 to the described operation of step 207, classification or classification failure until determining to treat sort file do not repeat them here.
In this step, whether exist the method for the characteristic of division that with the characteristic of division for the treatment of sort file conform to be: according to the characteristic of division of the regular institute of tectonic analysis foundation if judging in the characteristic of division of tectonic analysis rule institute foundation, treating the bit stream of sort file analyzes, judge whether the bit stream for the treatment of sort file possesses a certain characteristic of division in the described characteristic of division, if possess, then judge and treat that this characteristic of division that sort file possesses is the characteristic of division that conforms to.
Step 404: directly call the pairing analysis rule of the described characteristic of division that conforms to.
This step is identical with step step 202, when obtaining treating the FormatID of sort file, will return FormatID; Otherwise, will return the failure sign.
Step 405: whether analyze success according to the discriminatory analysis as a result that the analysis rule that is called returns,, then continue execution in step 406 if analyze successfully, otherwise, continue execution in step 407.
That analysis rule returns if be called in step 404 is FormatID, and then decision analysis success continues the classification operation in the execution in step 406; What analysis rule returned if be called in step 404 is the failure sign, and then decision analysis failure continues execution in step 407.
Step 406: sort out to analyzing the successful sort file for the treatment of.
In this step, can sort out analyzing the successful sort file for the treatment of, not repeat them here with reference to the mode of embodiment one step 204.So far, finish present embodiment certain is treated the classification of sort file, process ends.
Step 407: treat sort file analysis failure, carry out and the relevant subsequent operation of classification failure.
In this step, can handle the situation of classification failure, for example, the sort file for the treatment of of classification failure can be deposited under the specific directory, and show corresponding prompt etc., not repeat them here to the user according to method same as the prior art.
So far, finish present embodiment certain is treated the classification of sort file, process ends.
As seen from the above-described embodiment, present embodiment at first be provided with taxonomical hierarchy and with each taxonomical hierarchy respective classified feature, obtain all combinations of characteristic of division value and the corresponding relation of classification logotype, then, by directly calling the bit stream that the analysis rule analysis that conforms to the characteristic of division for the treatment of sort file is treated sort file, obtain treating the value of each characteristic of division of sort file, at last, determine to treat the classification logotype of sort file according to all combinations of set characteristic of division value and the corresponding relation of classification logotype.So, flexible, accurate document classification not only can be realized, the efficient of document classification can also be improved, the time of saving document classification.
More than the embodiment of file classifying method of the present invention is had been described in detail, the embodiment of document sorter of the present invention is described below by an embodiment.
Fig. 5 is the composition structural representation of document sorter of the present invention.Referring to Fig. 5, this document sorter comprises: classification setting module 510, control module 520 and analysis module 530.
Wherein, classification setting module 510 is used to be provided with characteristic of division, and sends to control module 520;
Control module 520 is used for obtaining according to characteristic of division all combinations of characteristic of division value, and sends to analysis module 530;
Analysis module 530, be used to analyze the bit stream for the treatment of sort file, obtain treating the combination of the characteristic of division value of sort file, and according to all combinations of characteristic of division value, with the described combination for the treatment of the characteristic of division value of sort file, determine to treat the classification of sort file, determined classification is returned to control module 520.
May further include at least one analytic unit in the analysis module 530 of document sorter shown in Figure 5, be designated as analytic unit 1, analytic unit 2 ... analytic unit n.Here, analytic unit 1~n is used for the bit stream for the treatment of sort file that meets same characteristic of division is analyzed.For example, can bmp analytic unit, wave analytic unit, mp3 analytic unit etc. be set respectively according to extension name.
If the sort file success is treated in certain analytic unit analysis, obtain corresponding classification, then return the classification for the treatment of sort file to control module 520, otherwise, return the failure sign to control module 520; When control module 520 is treated the classification of sort file in 530 analyses of control analysis module, be used for calling successively analytic unit 1~n, until determining to treat the classification of sort file or obtaining the failure sign.
Further, classification setting module 510 shown in Figure 5 can be used for being provided with and calls indication, and will call indication and send to control module 520; Describedly call indication and be used to determine the analysis rule that called.At this moment, control module 520 further can be used for calling indication according to described, calls and the described corresponding analytic unit of indication that calls.Like this, under the situation that can determine certain characteristic of division value, can reduce the time of carrying out document classification, improve the efficient of document classification.
In order to reduce the time of carrying out document classification, improve the efficient of document classification, also can in document sorter shown in Figure 5, further comprise: judge module 540.This judge module, can be used for judging and each analytic unit respective classified feature, whether there is the characteristic of division that conforms to the characteristic of division for the treatment of sort file,, then notifies control module 520 to call the corresponding analytic unit of the characteristic of division that conforms to this if exist.At this moment, control module 520 is further used for calling the corresponding analytic unit of the characteristic of division that conforms to this according to the notice of judge module 540.
In document sorter shown in Figure 5, can further include: sort operation module 550.
In the document sorter that comprises sort operation module 550, control module 520, be further used for being provided with all combinations of described characteristic of division value and the corresponding relation of classification logotype, and meet the split catalog of described taxonomical hierarchy, and be used for having determined with the corresponding relation of classification logotype and split catalog, that the sort file for the treatment of of classification logotype sends to sort operation module 550 for each classification logotype setting; By sort operation module 550 according to the classification logotype for the treatment of sort file of determining classification logotype, will treat sort file deposit in its classification logotype respective classified catalogue in.
This sort operation module 550 also can be arranged in the control module 520.
As seen from the above-described embodiment, in the document sorter disclosed in this invention, at first characteristic of division is set by classification setting module; Then, obtain all combinations of characteristic of division value according to characteristic of division by control module; At last, treat the bit stream of sort file by the analysis module analysis, obtain treating the combination of the characteristic of division value of sort file, and according to all combinations of characteristic of division value, with the described combination for the treatment of the characteristic of division value of sort file, determine to treat the classification of sort file, more determined classification is returned to control module.So, because analysis module treats that by analysis the sort file bit stream can access the value of all characteristic of divisions of this document, combination that the classification of sort file is based on the characteristic of division value carries out and the present invention treats, therefore, when taxonomical hierarchy and characteristic of division change, need not document sorter of the present invention is done any change, this document sorter carries out document classification with regard to can be automatically according to new taxonomical hierarchy and characteristic of division corresponding classification logotype being set, thereby has realized flexibly, accurate document classification.
The above is preferred embodiment of the present invention only, is not to be used to limit protection scope of the present invention.All any modifications of being done within the spirit and principles in the present invention, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (12)
1, a kind of file classifying method is characterized in that, may further comprise the steps:
At least one characteristic of division is set, obtains all combinations of described characteristic of division value;
Treat the bit stream of sort file and analyze, obtain the described combination for the treatment of the characteristic of division value of sort file;
According to all combination and described combinations for the treatment of the characteristic of division value of sort file of described characteristic of division value, determine the described classification for the treatment of sort file.
2, method according to claim 1 is characterized in that, at least one analysis rule further is set, and described each analysis rule is used for the bit stream for the treatment of sort file that meets same characteristic of division is analyzed;
Described bit stream analysis for the treatment of sort file is: call described analysis rule successively and treat the bit stream of sort file and analyze.
3, method according to claim 2 is characterized in that, further is provided with and calls indication, describedly calls indication and is used to determine the analysis rule that called;
Described call analysis rule successively before, further judge whether to be provided with the described indication of calling, if be provided with, then call with the described corresponding analysis rule of indication that calls and treat the bit stream of sort file and analyze; Otherwise, continue to carry out the operation of calling analysis rule successively.
4, method according to claim 2 is characterized in that, described call analysis rule successively before, further judge to be provided with in the characteristic of division of analysis rule institute foundation whether have the characteristic of division that conforms to the described characteristic of division for the treatment of sort file;
If exist, then call the bit stream that sort file is treated in the pairing analysis rule analysis of the described characteristic of division that conforms to; Otherwise, continue to carry out the operation of calling analysis rule successively.
5, according to each described method of claim 1 to 4, it is characterized in that, after all combinations that obtain described characteristic of division value, all combinations of described characteristic of division value and the corresponding relation of classification logotype are set further;
Described definite described classification of sort file for the treatment of is: according to all combinations of described characteristic of division value and the corresponding relation and the described combination for the treatment of the characteristic of division value of sort file of classification logotype, determine the described classification logotype for the treatment of sort file.
6, according to each described method of claim 1 to 4, it is characterized in that, the taxonomical hierarchy corresponding with described characteristic of division further is set;
In all combinations of described characteristic of division value value put in order and the combination of the described characteristic of division value for the treatment of sort file in putting in order of value meet putting in order of described taxonomical hierarchy.
7, method according to claim 6 is characterized in that, further meets the split catalog that puts in order of described taxonomical hierarchy for each classification setting;
After determining to treat the classification of sort file, further with described treat sort file deposit in described classification respective classified catalogue in.
8, a kind of document sorter is characterized in that, comprising: classification setting module, control module and analysis module;
Described classification setting module is used to be provided with characteristic of division;
Described control module is used for obtaining according to characteristic of division all combinations of characteristic of division value, and sends to analysis module;
Described analysis module, be used to analyze the bit stream for the treatment of sort file, obtain the described combination for the treatment of the characteristic of division value of sort file, and according to all combinations of described characteristic of division value, with the described combination for the treatment of the characteristic of division value of sort file, determine the described classification for the treatment of sort file, described classification is returned to control module.
9, document sorter according to claim 8 is characterized in that, further comprises at least one analytic unit in the described analysis module;
Described analytic unit is used for the bit stream for the treatment of sort file that meets same characteristic of division is analyzed, if analyze successfully, then returns the classification for the treatment of sort file, otherwise returns the failure sign;
Described control module is used for calling successively described analytic unit, describedly treats the classification of sort file or obtains the failure sign until determining.
10, document sorter according to claim 9 is characterized in that, described classification setting module is further used for being provided with and calls indication, and will describedly call to indicate and send to control module; Describedly call indication and be used to determine the analysis rule that called;
Described control module is further used for calling indication according to described, calls and the described corresponding analytic unit of indication that calls.
11, document sorter according to claim 9 is characterized in that, further comprises in the described document sorter: judge module;
Described judge module, be used for judging and each analytic unit respective classified feature, whether there is the characteristic of division that conforms to the described characteristic of division for the treatment of sort file,, then notifies described control module to call and the corresponding analytic unit of the described characteristic of division that conforms to if exist;
Described control module is further used for calling and the corresponding analytic unit of the described characteristic of division that conforms to according to the notice of described judge module.
12, according to Claim 8 to 11 each described document sorters, it is characterized in that, further comprise in the described document sorter: the sort operation module;
Described control module, be further used for being provided with all combinations of described characteristic of division value and the corresponding relation of classification logotype, and meet the split catalog that described taxonomical hierarchy puts in order, and be used for having determined with the corresponding relation of described classification logotype and split catalog, that the sort file for the treatment of of classification logotype sends to described sort operation module for each classification logotype setting;
Described sort operation module is used for the classification logotype for the treatment of sort file of having determined classification logotype according to described, with described treat sort file deposit in described classification logotype respective classified catalogue in.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100994040A CN100458796C (en) | 2007-05-18 | 2007-05-18 | File classifying method and file classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100994040A CN100458796C (en) | 2007-05-18 | 2007-05-18 | File classifying method and file classifier |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101051322A CN101051322A (en) | 2007-10-10 |
CN100458796C true CN100458796C (en) | 2009-02-04 |
Family
ID=38782734
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2007100994040A Expired - Fee Related CN100458796C (en) | 2007-05-18 | 2007-05-18 | File classifying method and file classifier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100458796C (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102646097B (en) * | 2011-02-18 | 2019-04-26 | 腾讯科技(深圳)有限公司 | A kind of clustering method and device |
CN105446705B (en) | 2014-06-30 | 2019-06-21 | 国际商业机器公司 | Method and apparatus for determining the characteristic of configuration file |
CN105868272A (en) * | 2016-03-18 | 2016-08-17 | 乐视网信息技术(北京)股份有限公司 | Multimedia file classification method and apparatus |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1617142A (en) * | 2003-09-29 | 2005-05-18 | 奥林巴斯株式会社 | Information managing method and information managing device |
EP1696340A1 (en) * | 2003-12-15 | 2006-08-30 | Sony Corporation | Information processing apparatus, information processing method, and computer program |
US20060206495A1 (en) * | 2003-04-07 | 2006-09-14 | Johan Sebastiaan Van Gageldonk | Method and apparatus for grouping content items |
-
2007
- 2007-05-18 CN CNB2007100994040A patent/CN100458796C/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060206495A1 (en) * | 2003-04-07 | 2006-09-14 | Johan Sebastiaan Van Gageldonk | Method and apparatus for grouping content items |
CN1617142A (en) * | 2003-09-29 | 2005-05-18 | 奥林巴斯株式会社 | Information managing method and information managing device |
EP1696340A1 (en) * | 2003-12-15 | 2006-08-30 | Sony Corporation | Information processing apparatus, information processing method, and computer program |
Also Published As
Publication number | Publication date |
---|---|
CN101051322A (en) | 2007-10-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7451139B2 (en) | Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus | |
US8015198B2 (en) | Method for automatically indexing documents | |
DE69807716T2 (en) | DETECTING COMPUTER VIRUSES DISTRIBUTED OVER MULTIPLE DATA FLOWS | |
CN106489149A (en) | A kind of data mask method based on data mining and mass-rent and system | |
CN102411587A (en) | Webpage classification method and device | |
MXPA05006991A (en) | Technique evaluating device, technique evaluating program, and technique evaluating method. | |
KR101505546B1 (en) | Keyword extracting method using text mining | |
CN108256587A (en) | Determining method, apparatus, computer and the storage medium of a kind of similarity of character string | |
CN100458796C (en) | File classifying method and file classifier | |
CN102073684A (en) | Method and device for excavating search log and page search method and device | |
AU2002331728A1 (en) | A method for automatically indexing documents | |
CN107609097A (en) | A kind of Data Integration sorting technique | |
CN101794283A (en) | Method and system for processing character strings and matcher | |
Castano et al. | A constructive approach to reuse of conceptual components | |
JP4604097B2 (en) | Document classification assigning method, system or computer program | |
CN107168788A (en) | The dispatching method and device of resource in distributed system | |
CN113806321A (en) | Log processing method and system | |
CN116820960A (en) | Software testing method and electronic equipment | |
CN101520861A (en) | Data event sending method and device and event handling system | |
CN106251093A (en) | A kind of support checks and accepts the acceptance of work method that attribute dynamically configures | |
US20050198059A1 (en) | Database and database management system | |
CN106022374A (en) | Method and device for classifying historical process data | |
JP2002251590A (en) | Document analyzer | |
CN112367406B (en) | Method for identifying account behavior analysis corresponding account correlation attribute in web application system | |
CN111488327A (en) | Data standard management method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20090204 Termination date: 20120518 |