CN103310025A

CN103310025A - Unstructured-data description method and device

Info

Publication number: CN103310025A
Application number: CN2013102848970A
Authority: CN
Inventors: 鄂海红; 韩晶; 宋美娜; 郑聪; 许可; 毕建鹏; 宋俊德; 黎燕; 于艳华
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2013-07-08
Filing date: 2013-07-08
Publication date: 2013-09-18

Abstract

The invention discloses an unstructured-data description method and device, wherein the unstructured-data description method comprises the following steps: collecting and importing the attribute information of unstructured-data files in a manual or automatic manner; generating JSON (Java Script Object Notation) files describing the attributes of the unstructured-data files according to the collected attribute information, and building a data model; saving the unstructured-data files and the JSON files corresponding to the unstructured-data files. The unstructured-data description method and device have the advantages of comprehensiveness and high efficiency.

Description

The describing method of unstructured data and device

Technical field

The invention belongs to database and retrieval technique field, be specifically related to a kind of describing method and device of unstructured data.

Background technology

Along with the arrival of large data age, unstructured data accounts for increasing ratio in the data.IDC(Internet data center) report shows, to 2012, unstructured data occupies ratio will reach more than 75% of the whole data volume in internet, and wherein 50%～75% all to be that focus be put on man produce.Unstructured data is not only contained numerous types of data, comprises document, picture, HTML, image and audio/video etc., also has the features such as obvious user personality, storage medium is various, owning application is various.The common distributed earth of existing unstructured data is stored in the enterprise servers or in the PC, the data owner manages by file system or data management system etc.Yet under large data environment, the destructuring feature of data has proposed new demand to data organization and management method, and traditional unstructured data administrative skill also faces new problem.The unstructured data management comprises sets up data model, data storage, data cleansing, the aspects such as market demand, wherein, setting up data model is the basis of whole management process, therefore, active data model reasonable in design can be established good data structure basis for the unstructured data management.

At present domestic and international research to the unstructured data model mainly concentrates on the aspects such as structural information extraction, entity extraction and data attribute classification.WinFS is a kind of semantic file systems, and it extracts＜attribute, value by the transcript analysis mode〉represent that unstructured data, existing scheme have (1) to manage unstructured data by drawing-out structure information; (2) come modeling data, metadata and data correlation by relational model; (3) proposition is based on the tetrahedron data model of general features, semantic feature, low-level image feature, raw data; (4) proposition strengthens search effect based on the desktop retrieval scheme of Metadata Extraction by contextual information; (5) proposition employing topic model comes the content in the mark unstructured data.

As from the foregoing, existing unstructured data model is paid close attention to the feature of unstructured data itself.Yet, except the feature of data own, the factors such as user personality, behavior, time of the act, mutual other user or data is all closely bound up with the unstructured data of final generation, simultaneously, unstructured data also possesses some expansion character, such as field, authority, responsible official etc.These factors and expansion character are not included in the existing unstructured data attribute space, be unfavorable for more fully representing unstructured data, satisfy complicated Search Requirement.For example, mostly present retrieval for unstructured data is the key search for data autograph or content, and the accuracy of result for retrieval is difficult to reach user's expectation; Simultaneously, along with the growth of data volume, data retrieval speed is subject to severely restricts.Especially aspect data model, existing model can't satisfy complicated retrieval to the demand of data model, the description of data is confined to the fundamental property of data file itself.

Summary of the invention

The present invention one of is intended to solve the problems of the technologies described above at least to a certain extent or provides at least a kind of useful commerce to select.For this reason, one object of the present invention be to propose a kind of comprehensively, the describing method of unstructured data that efficient is high; Another object of the present invention be to propose a kind of comprehensively, the tracing device of unstructured data that efficient is high.

Describing method according to the unstructured data of the embodiment of the invention may further comprise the steps: S1. gathers the attribute information that imports the unstructured data file by mode manually or automatically; The described attribute information that S2. will collect generates the JSON file of the attribute description of unstructured data file, and carries out the data modeling; S3. store described unstructured data file and corresponding described JSON file.

In one embodiment of the invention, described attribute information comprises: the base attribute class, and described base attribute class comprises file attribute group, user property group and Authorization Attributes group; Contents attribute class, described contents attribute category comprise describes set of properties and semantic attribute group; The property attribute class, described property attribute category comprises medium property group, document properties group, audio attribute group, video attribute group and image attributes group; The behavior property class, described behavior property category comprises file temperature set of properties, task attribute group, context property group and interactive information set of properties; The environment attribute class, described environment attribute category comprises theme temperature set of properties and similar subject property group; And the time attribute class, described time attribute category comprises quiet hour set of properties and dynamic time set of properties.

In one embodiment of the invention, each described set of properties comprises that one or more attributes are first.

In one embodiment of the invention, described data modeling comprises: tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling.

In one embodiment of the invention, in the described hexahedron structure data modeling, hexahedral each face represents respectively a kind of Attribute class, every attribute of unstructured data is distributed on the affiliated plane with joint form, taking weights between the different attribute unit to same set of properties is that 1 two-dimensional plot connects, it is that 2 two-dimensional plot connects that attribute unit between the different attribute group of same Attribute class is taked weights, and it is that 3 two-dimensional plot connects that the attribute unit between the different attribute class is taked weights.

Tracing device according to the unstructured data of the embodiment of the invention comprises: the mode that attribute information acquisition module, described attribute information acquisition module are used for by manually or automatically gathers the attribute information that imports unstructured data; The attribute information processing module, described attribute information processing module links to each other with described attribute information acquisition module, and the described attribute information that is used for collecting generates the JSON file of the attribute description of unstructured data file, and carries out the data modeling; Memory module, described memory module links to each other with described attribute information processing module, is used for storing described unstructured data file with the described JSON file of correspondence; And retrieval module, described retrieval module links to each other with memory module, is used for and user interactions, retrieves according to retrieval mode and the user-defined retrieval input of user selection, and the result that retrieval obtains is offered the user browses.

In one embodiment of the invention, in the described attribute information acquisition module, described attribute information comprises: the base attribute class, and described base attribute class comprises file attribute group, user property group and Authorization Attributes group; Contents attribute class, described contents attribute category comprise describes set of properties and semantic attribute group; The property attribute class, described property attribute category comprises medium property group, document properties group, audio attribute group, video attribute group and image attributes group; The behavior property class, described behavior property category comprises file temperature set of properties, task attribute group, context property group and interactive information set of properties; The environment attribute class, described environment attribute category comprises theme temperature set of properties and similar subject property group; And the time attribute class, described time attribute category comprises quiet hour set of properties and dynamic time set of properties.

In one embodiment of the invention, in the described attribute information processing module, described attribute information is carried out tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling.

In one embodiment of the invention, described tracing device and retrieval module are integrated, based on the result of described tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling, provide the unstructured data retrieval of basic retrieval mode, tree search mode, hexahedron structure retrieval mode and galactic structure retrieval mode for the user.

In one embodiment of the invention, described basic retrieval mode is based on JAQL inquiry realization.

In sum, the describing method of unstructured data of the present invention and device have comprehensively efficiently advantage.

Additional aspect of the present invention and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Description of drawings

Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment in conjunction with following accompanying drawing, wherein:

Fig. 1 is the process flow diagram of describing method of the unstructured data of the embodiment of the invention;

Fig. 2 is unstructured data attribute information tree structure logical schematic;

Fig. 3 is unstructured data attribute information hexahedron structure logical schematic;

Fig. 4 is unstructured data attribute information galactic structure logical schematic;

Fig. 5 is the structured flowchart of tracing device of the unstructured data of the embodiment of the invention.

Embodiment

The below describes embodiments of the invention in detail, and the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, be intended to for explaining the present invention, and can not be interpreted as limitation of the present invention.

The present invention is intended on the existing unstructured data model investigation basis, for the demand of user to the complexity retrieval, analyze the behavioral trait of data manipulation main body, consider that simultaneously data produce the external factors such as background and affiliated field, have proposed a kind of describing method of unstructured data.

As shown in Figure 1, the describing method of the unstructured data of the embodiment of the invention may further comprise the steps:

S1. gather the attribute information of importing unstructured data by mode manually or automatically.

S2. the attribute information that collects is generated the JSON file of the attribute description of unstructured data file, and carry out the data modeling.

S3. store the unstructured data file with the JSON file of correspondence.

Alternatively, after step S3, also comprise step: retrieve according to the retrieval mode of user selection and user-defined retrieval input, and the result that retrieval obtains is offered the user browse.

In one embodiment of the invention, step S1 is actually by introducing user behavior, the attribute information relevant with unstructured data is divided into a plurality of Attribute class, each Attribute class comprises at least one set of properties, each set of properties comprises an attribute unit at least, thereby has realized unstructured data is carried out system, comprehensively describes.For those skilled in the art are understood better, first several Key Terms are defined:

Unstructured data file or catalogue that data object---user produces or operates.

The objective fundamental property that attribute---unstructured data possesses itself, such as file size, file path etc., and because main body some information that operation produces to unstructured data, such as Last modification time, access time etc.Some useful informations that produce in the objectivity that the above file possesses and main body and the file interaction process are referred to as attribute.

Attribute class---because some attribute of unstructured data is classified as a large class to character, the similar attribute of effect acting on and possessing in nature similarity, be called Attribute class.

The unstructured data model---by unstructured data itself and six integral body that Attribute class forms.

In a preferred embodiment of the invention, a unstructured data comprises a data object and corresponding attribute information, and this attribute information further comprises six Attribute class.Six Attribute class are respectively base attribute class, contents attribute class, characteristic attribute class, behavior property class, environment attribute class and time attribute class.Particularly:

(1) base attribute class BasicAttrClass: refer to the base attribute group of unstructured data, comprise file attribute group, user property group and Authorization Attributes group.

File attribute group FileAttr: some intrinsic properties of expression unstructured data file, catalogue, the file attribute group can be with four-tuple (Size, a FilePath, FileType, FileName) expression, represent respectively file size, file path, file type and filename.

User property group UserAttr: expression and the user-dependent set of properties of unstructured data, the user property group can be with a four-tuple (User, UserGroup, Company, Program) expression, represent respectively the user, user group, data affiliated unit and operate the required application software of these data.

Authorization Attributes group AuthorityAttr: the association attributes group of the operating right that the expression different user possesses unstructured data, the Authorization Attributes group can be with a tlv triple (UserAuthority, GroupAuthority, OtherAuthority) expression, represent respectively user right, the user organizes authority and the authority of other users to unstructured data.

(2) contents attribute class ContentAttr: refer to the set of properties relevant with the unstructured data content, comprise and describe set of properties and semantic attribute group.

Describe set of properties DescriptionAttr: expression is to the descriptor of unstructured data content, describing set of properties can be with a four-tuple (Title, Topic, Language, CodingMode) expression, represent respectively the unstructured data title, subject information, the language of using in the unstructured data and coded system.

Semantic attribute group SemanticAttr: represent the set of properties relevant with the unstructured data semantic content, the semantic attribute group can be with tlv triple (Tag, a Field, URI) expression, the label that represents respectively unstructured data, affiliated scientific domain, the generic resource identifier of data.

(3) characteristic attribute class CharacteristicAttr: represent the particular attribute group relevant with unstructured data medium type feature, the characteristic attribute class comprises medium property group, document properties group, audio attribute group, image attributes group and video attribute group.

Medium property group MediaAttr: the medium property group can with a five-tuple (ArtistInvolved, Date, Genre, Album, Length) expression, represent respectively the artist who participates in creation, publication date, school style, special edition and media length.

Document properties group DocAttr: the feature of expression Doctype data, the document properties group can represent with two tuples (PageRange, Words), represent respectively page number scope and the document number of words of document.

Audio attribute group AudioAttr: the feature of expression audio file, the audio attribute group can represent with two tuples (BitRate, SampleRate), represent respectively bit rate and the sample frequency of audio file.

Video attribute group VideoAttr: the feature of expression video file, the video attribute group can be with four-tuple (FrameRate, a TotalBitrate, FrameWidth, FrameHeight) represent, represent respectively the frame rate of video, gross bit rate, frame width and vertical frame dimension degree.

Image attributes group ImageAttr: be used for the feature of Description Image type file, the image attributes group can be with a five-tuple (Width, Height, HorizontalResolution, VerticalResolution, BitDepth) expression, respectively the width of presentation video, highly, horizontal resolution, vertical resolution and bit depth.

(4) behavior property class BehaviorAttr: with the set of properties that unstructured data produces, operating main body is relevant, the behavior property class comprises file temperature set of properties, task attribute group, context property group and mutual set of properties.

File temperature set of properties FileHeatAttr: expression is to the frequency information of unstructured data file operation itself, file temperature set of properties can be with a tlv triple (RecentOperationType, CumulativeTime, CumulativeTimes) expression, submeter represents the last action type, accumulative total access time and accumulative total access times.

Task attribute group TaskAttr: the characteristic relevant with task that main body is done, the task attribute group can be with eight tuples (TaskName, an Organizers, ImplePeriod, Phase, ParentTask, Collaborator, Category, Status) expression, represent respectively task name, organizer, implementation period, stage, father's task, task co-worker, task classification and file status.

Context property group ContextAttr: expression and data have other set of properties of context relation, the context property group can be with a tlv triple (FileAtSameTime, WebAtSameTime, Location) expression, represent respectively the alternative document that the last access is opened simultaneously, the Internet resources that the last access is opened simultaneously, the geographic position.

Mutual set of properties InterActiveAttr: carry out mutual situation with data in the expression subject behavior process, mutual set of properties can be with a hexa-atomic group of (SourceWeb, SourceEmail, SourceInsMessage, Purpose, Contact, AssociatedFile) represent, represent respectively the address of source page, the source mail, the source instant message, the purpose of data interaction, people associated with the data and associated with.

(5) environment attribute class EnvironmentAttr: the set of properties relevant with unstructured data external environment condition of living in, environment attribute class comprise two set of properties of theme temperature and similar main body.

Theme temperature set of properties TopicHeatAttr: the temperature information of expression unstructured data theme, theme temperature set of properties can be with a tlv triple (NumofSearchResult, Numofliterature, NumofTweet) expression, the order of magnitude that represents respectively theme Search Results in search engine, the document number collection of same subject, the order of magnitude of the number of times that theme is mentioned in microblogging.

Similar main body set of properties SimilarSubjectAttr: the temperature information of expression unstructured data main body, similar main body set of properties can be with two tuple (SimilarIndividual, SimilarEnterprise) expression, expression has the people and the enterprise customer with same label of same label respectively.

(6) time attribute class TimeAttr: every set of properties of expression and unstructured data time correlation comprises quiet hour set of properties and dynamic time set of properties.

Quiet hour set of properties StaticTimeAttr: the changeless time attribute group of expression unstructured data, the quiet hour set of properties can represent with a tuple (CreateTime), represent the creation-time of unstructured data.

Dynamic time set of properties DynamicTimeAttr: the transformable time attribute group of expression unstructured data, the dynamic time set of properties can be with a four-tuple (LastModifyTime, LastSaveTime, LastAccessTime, LastAttrchangeTime) expression, represent respectively the last content modification time of unstructured data, last holding time, the time that last access time and last attribute change.

In one embodiment of the invention, the attribute description that among the step S2 attribute information that collects is generated the unstructured data file is the JSON file, now JSON simply is described below.

Unstructured data possesses magnanimity and real-time feature.At first, unstructured data produces in real time, its character in time, the many factors such as event, main body and changing, when realizing the logical organization of unstructured data, extensibility is extremely important.Secondly, unstructured data is magnanimity, and attribute description information is attached as unstructured data, and the less storage space that takies is better.Based on above 2 points, the present invention adopts JavaScript Object Notation (JSON) to describe above-mentioned model.JSON is as a kind of data interchange format of lightweight, and it has the advantages such as lightweight, extensibility, quick indexing, is fit to very much unstructured data is described.

Here to a document data " Prepregnancy physical examination.docx " tectonic setting attribute example.Each Attribute class of above-mentioned document is described with JSON, extracted and set property, the environment attribute of the document is described below:

In one embodiment of the invention, the data modeling among the step S2 comprises: tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling.This step is based in fact unstructured data describing method set forth above, adopts three kinds of different logical organizations that the attribute of unstructured data is managed.Particularly:

(1) tree structure:

Fig. 2 showed and utilizes tree structure the unstructured data attribute to be carried out the method for modeling, and every attribute of unstructured data launches step by step in the mode of tree graph.

Utilize tree structure that the advantage that the attribute of unstructured data manages is: tree structure has branch's level characteristic, and is also quite ripe based on all kinds of algorithms of tree graph, is easy to realize the management to the unstructured data attribute.

(2) hexahedron structure:

Fig. 3 has showed and utilizes hexahedron structure unstructured data to be carried out the method for modeling, hexahedral each face represents respectively a kind of Attribute class, every attribute of unstructured data is distributed on the affiliated plane with joint form, taking weights between the different attribute unit to same set of properties is that 1 two-dimensional plot connects, it is that 2 two-dimensional plot connects that attribute unit between the different attribute group of same Attribute class is taked weights, and it is that 3 two-dimensional plot connects that the attribute unit between the different attribute class is taked weights.

For example: be that 1 two-dimensional plot connects such as adopting weights between file size and the file path, adopting weights between file size and the user is that 2 two-dimensional plot connects, and it is that 3 two-dimensional plot connects that file size and subject information adopt weights.

Utilize hexahedron structure that the advantage that the attribute of unstructured data manages is: a kind of retrieval mode that goes forward one by one step by step is provided, by adding step by step search condition, formed a searching route at the figure of a Weighted Coefficients, each bar searching route can be regarded the doubly linked list of a Weighted Coefficients as, can further retrieving also, but chain takes back previous step, each node all can be used as head node, and the step by step link of the doubly linked list by this cum rights has realized going forward one by one step by step of information retrieval.Simultaneously the algorithm of the figure of cum rights is also comparatively ripe, only provide the one-level search condition such as the user after (such as filename) wish to check as early as possible result for retrieval, system just can present to the user with this result for retrieval by shortest path first.

(3) galactic structure:

Fig. 4 has showed and utilizes galactic structure unstructured data to be carried out the method for modeling.

Utilize galactic structure that the advantage that the unstructured data attribute manages is: the attribute of each unstructured data can be regarded as centered by unstructured data itself, the star-like node that gives off, simultaneously in order to make things convenient for the retrieval of related data, the user can be by arranging the mode of pointer, with existing data object or the attribute of stronger linking relationship to couple together, made things convenient for so the quick discovery of associated documents.

As from the foregoing, after employing embodiment of the invention describing method is described unstructured data, can supply user search.

In one embodiment of the invention, tracing device and retrieval module are integrated, based on the result of tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling, provide the unstructured data retrieval of basic retrieval mode, tree search mode, hexahedron structure retrieval mode and galactic structure retrieval mode for the user.Wherein, basic retrieval mode is based on JAQL inquiry realization.

Particularly, according to the basic retrieval mode of user selection, tree search mode, hexahedron structure retrieval mode or galactic structure retrieval mode, and retrieve according to user-defined retrieval input (being keyword), and the result that retrieval obtains is offered the user browse.For those skilled in the art are understood better, these four kinds of retrieval modes are exemplified below:

(1) basic retrieval mode example

Basic retrieval mode is take basic JAQL inquiry as the basis.JAQL is the data query language towards JSON (but it not only supports JSON) that an IBM is donated to the community that increases income, and can process structuring and unstructured data collection by this language.JAQL passes through filter, join, some core expression formulas such as group by can operate the JSON data set, the more important thing is that JAQL can also directly operate the data that are stored among the HDFS, in order to realize concurrency, JAQL can also be " low level " inquiry that is comprised of the MapReduce operation in due course with high-level query rewrite.The JAQL engine becomes Map and Reduce task in inside with query conversion, can significantly shorten and analyze the relevant application development time of mass data in Hadoop.Therefore, this paper adopts JSON to represent and stores the unstructured data model, carry out query manipulation by JAQL and describe, thereby the complexity that finally realizes large data magnanimity unstructured data is retrieved.

Large data research in " networked data resource management " project is as example, and Undata.json is the json file of describing unstructured data.

Retrieval " all data of large data management project ", coupling " task name (TaskName) " attribute and " father's task (ParentTask) " attribute, screening its data type, filename and path shows, by result for retrieval as seen, qualified data comprise a text document and a video.Retrieval: " all data of large data management project "

Because the large data, services that makes up based on above logical organization is when facing retrieval, analysis and visualization request, its request can be converted into the JAQL statement, therefore, range of search is dwindled this advantage so that large data, services can be identified and the localizing objects unstructured data fast.

(2) tree search mode example

A. the optional retrieval mode of unstructured data indexing unit prompting user comprises: basic retrieval mode, tree search mode, regular hexahedron type retrieval mode and galaxy retrieval mode, and every kind of mode carried out brief description;

B. the user only knows a certain Attribute class information of unstructured data, and for example, the user only knows the base attribute category information of unstructured data;

C. user selection tree search mode;

D. the unstructured data indexing unit provides detailed description and the explanation of each attributive classification of tree search mode, and whether prompting user determines to use tree search;

E. the user is by Information and device information compare and confirm to use tree-like inquiry or return previous menu;

F. the input information that belongs to the base attribute class that the user will grasp is to the unstructured data indexing unit;

G. the unstructured data indexing unit is converted to query statement with user's input, and provides result for retrieval.

(3) hexahedron structure retrieval mode example

A. the optional retrieval mode of unstructured data indexing unit prompting user comprises: basic retrieval mode, attribute retrieval mode, regular hexahedron type retrieval mode and galaxy retrieval mode, and every kind of mode carried out brief description;

B. the user only knows some attribute of unstructured data, and for example, the user only knows the filename of unstructured data, is video file, and the subject information of this document;

C. user selection regular hexahedron type retrieval mode;

D. the unstructured data indexing unit provides detailed description and the explanation of each attributive classification of regular hexahedron type retrieval mode, and whether prompting user determines to use the retrieval of regular hexahedron type;

E. the user is by Information and device information compare and confirm to use the inquiry of regular hexahedron type or return previous menu;

F. the user inputs to the unstructured data indexing unit with the attribute information of grasping;

(4) galactic structure retrieval mode example

B. the user only knows the part attribute of unstructured data, and wishes to retrieve as soon as possible all and have with it file than High relevancy, and for example, the user only knows the name of certain film, also wonders simultaneously to also have the same period which film to show;

C. user selection galaxy retrieval mode;

D. the unstructured data indexing unit provides detailed description and the explanation of each attributive classification of galaxy retrieval mode, and whether prompting user is determined to use the galaxy mode to retrieve;

E. the user is by Information and device information compare and confirm to use the galactic structure inquiry or return previous menu;

G. the unstructured data indexing unit is converted to query statement with user input, the film that can retrieve fast the required film of user and show the same period by the pointer that arranges between film publication date same period attribute.

In sum, aspect large data service data model construction, be applicable to the general unstructured data model problem of large data, services for shortage, that this paper has proposed to comprise is tree-like, three kinds of logical organization modeling methods of regular hexahedron type, galactic structure, and design covers the unstructured data model of the comprehensive data characteristicses such as user behavior, semantic background.Example checking structure shows, has preferably versatility and comprehensive based on the model of three kinds of Logic Structure Design, and the JSON by using lightweight and the description inquiry mode of JAQL provide strong basis for realizing the large data, services of destructuring simultaneously.The describing method of unstructured data of the present invention has comprehensively efficiently advantage.

As shown in Figure 5, the tracing device according to the unstructured data of the embodiment of the invention comprises: attribute information acquisition module 100, attribute information processing module 200, memory module 300.Wherein, attribute information acquisition module 100 is used for gathering by mode manually or automatically the attribute information of importing unstructured data; Attribute information processing module 200 links to each other with attribute information acquisition module 100, and the attribute information that is used for collecting generates the JSON file of the attribute description of unstructured data file, and carries out the data modeling; The storage mould, 300 link to each other with attribute information processing module 200, are used for storing the unstructured data file with the JSON file of correspondence.Preferably, also can comprise a retrieval module 400 that links to each other with memory module 300, this retrieval module 400 is used for and user interactions, retrieves according to retrieval mode and the user-defined retrieval input of user selection, and the result that retrieval obtains is offered the user browses.

In one embodiment of the invention, in the attribute information acquisition module 100, attribute information comprises: base attribute class, base attribute class comprise file attribute group, user property group and Authorization Attributes group; Contents attribute class, contents attribute category comprise describes set of properties and semantic attribute group; Property attribute class, property attribute category comprise medium property group, document properties group, audio attribute group, video attribute group and image attributes group; Behavior property class, behavior property category comprise file temperature set of properties, task attribute group, context property group and interactive information set of properties; Environment attribute class, environment attribute category comprise theme temperature set of properties and similar subject property group; And the time attribute class, the time attribute category comprises quiet hour set of properties and dynamic time set of properties.

In one embodiment of the invention, each set of properties comprises that one or more attributes are first.

In one embodiment of the invention, in the attribute information processing module 200, attribute information is carried out tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling.

In one embodiment of the invention, in the hexahedron structure data modeling, hexahedral each face represents respectively a kind of Attribute class, every attribute of unstructured data is distributed on the affiliated plane with joint form, taking weights between the different attribute unit to same set of properties is that 1 two-dimensional plot connects, it is that 2 two-dimensional plot connects that attribute unit between the different attribute group of same Attribute class is taked weights, and it is that 3 two-dimensional plot connects that the attribute unit between the different attribute class is taked weights.

In one embodiment of the invention, in the retrieval module 400, the retrieval mode of user selection comprises: basic retrieval mode, tree search mode, hexahedron structure retrieval mode and galactic structure retrieval mode.

In one embodiment of the invention, basic retrieval mode is based on JAQL inquiry realization.

In sum, aspect large data service data model construction, be applicable to the general unstructured data model problem of large data, services for shortage, that this paper has proposed to comprise is tree-like, three kinds of logical organization modeling methods of regular hexahedron type, galactic structure, and design covers the unstructured data model of the comprehensive data characteristicses such as user behavior, semantic background.Example checking structure shows, has preferably versatility and comprehensive based on the model of three kinds of Logic Structure Design, and the JSON by using lightweight and the description inquiry mode of JAQL provide strong basis for realizing the large data, services of destructuring simultaneously.The tracing device of unstructured data of the present invention has comprehensively efficiently advantage.

Describe and to be understood in the process flow diagram or in this any process of otherwise describing or method, expression comprises module, fragment or the part of code of the executable instruction of the step that one or more is used to realize specific logical function or process, and the scope of preferred implementation of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by opposite order, carry out function, this should be understood by the embodiments of the invention person of ordinary skill in the field.

In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or the example in conjunction with specific features, structure, material or the characteristics of this embodiment or example description.In this manual, the schematic statement of above-mentioned term not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or characteristics can be with suitable mode combinations in any one or more embodiment or example.

Although the above has illustrated and has described embodiments of the invention, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art is not in the situation that break away from principle of the present invention and aim can change above-described embodiment within the scope of the invention, modification, replacement and modification.

Claims

1. the describing method of a unstructured data may further comprise the steps:

S1. gather the attribute information of importing unstructured data file by mode manually or automatically;

The described attribute information that S2. will collect generates the JSON file of the attribute description of unstructured data file, and carries out the data modeling;

S3. store described unstructured data file and corresponding described JSON file.

2. the describing method of unstructured data as claimed in claim 1 is characterized in that, described attribute information comprises:

The base attribute class, described base attribute class comprises file attribute group, user property group and Authorization Attributes group;

Contents attribute class, described contents attribute category comprise describes set of properties and semantic attribute group;

The property attribute class, described property attribute category comprises medium property group, document properties group, audio attribute group, video attribute group and image attributes group;

The behavior property class, described behavior property category comprises file temperature set of properties, task attribute group, context property group and interactive information set of properties;

The environment attribute class, described environment attribute category comprises theme temperature set of properties and similar subject property group; And

The time attribute class, described time attribute category comprises quiet hour set of properties and dynamic time set of properties.

3. the describing method of unstructured data as claimed in claim 1 is characterized in that, each described set of properties comprises that one or more attributes are first.

4. such as the describing method of each described unstructured data of claim 1-3, it is characterized in that, described data modeling comprises: tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling.

5. the describing method of unstructured data as claimed in claim 4, it is characterized in that, in the described hexahedron structure data modeling, hexahedral each face represents respectively a kind of Attribute class, every attribute of unstructured data is distributed on the affiliated plane with joint form, taking weights between the different attribute unit to same set of properties is that 1 two-dimensional plot connects, it is that 2 two-dimensional plot connects that attribute unit between the different attribute group of same Attribute class is taked weights, and it is that 3 two-dimensional plot connects that the attribute unit between the different attribute class is taked weights.

6. the tracing device of a unstructured data is characterized in that, comprising:

The attribute information acquisition module, described attribute information acquisition module is used for gathering by mode manually or automatically the attribute information of importing unstructured data;

The attribute information processing module, described attribute information processing module links to each other with described attribute information acquisition module, and the described attribute information that is used for collecting generates the JSON file of the attribute description of unstructured data file, and carries out the data modeling;

Memory module, described memory module links to each other with described attribute information processing module, is used for storing described unstructured data file with the described JSON file of correspondence.

7. the tracing device of unstructured data as claimed in claim 6 is characterized in that, in the described attribute information acquisition module, described attribute information comprises:

8. the tracing device of unstructured data as claimed in claim 7 is characterized in that, each described set of properties comprises that one or more attributes are first.

9. such as the tracing device of each described unstructured data of claim 6-8, it is characterized in that, in the described attribute information processing module, described attribute information is carried out tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling.

10. the tracing device of unstructured data as claimed in claim 9, it is characterized in that, in the described hexahedron structure data modeling, hexahedral each face represents respectively a kind of Attribute class, every attribute of unstructured data is distributed on the affiliated plane with joint form, taking weights between the different attribute unit to same set of properties is that 1 two-dimensional plot connects, it is that 2 two-dimensional plot connects that attribute unit between the different attribute group of same Attribute class is taked weights, and it is that 3 two-dimensional plot connects that the attribute unit between the different attribute class is taked weights.

11. the tracing device of unstructured data as claimed in claim 7, it is characterized in that, described tracing device and retrieval module are integrated, based on the result of described tree structure data modeling, hexahedron structure data modeling and galactic structure data modeling, provide the unstructured data retrieval of basic retrieval mode, tree search mode, hexahedron structure retrieval mode and galactic structure retrieval mode for the user.

12. the tracing device of unstructured data as claimed in claim 11 is characterized in that, described basic retrieval mode is based on the JAQL inquiry and realizes.