CN109086444B

CN109086444B - Data standardization method and device and electronic equipment

Info

Publication number: CN109086444B
Application number: CN201810940191.8A
Authority: CN
Inventors: 陈红梅; 王文剑
Original assignee: Jilin Yillion Bank Co ltd
Current assignee: Jilin Yillion Bank Co ltd
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2020-12-29
Anticipated expiration: 2038-08-17
Also published as: CN109086444A

Abstract

The invention provides a data standardization method, a device and electronic equipment, wherein after messages to be processed with different data structures are obtained, the format of the messages to be processed is firstly converted to obtain a middle message, and then the middle message is subjected to field analysis and standardization processing to obtain standardized data, so that the messages to be processed with different data structures can be subjected to standardization processing to obtain a data structure with a unified format.

Description

Data standardization method and device and electronic equipment

Technical Field

The invention relates to the field of data processing, in particular to a data standardization method and device and electronic equipment.

Background

With the increasing abundance of loan products on the internet, the application of third-party credit investigation data in the financial field is increasingly wide and deepened. Most internet financial institutions will introduce credit investigation data of third-party institutions as examination and approval bases during loan examination and approval. Such as sesame credit score data incorporating sesame credits, etc.

However, the credit investigation data of the third-party organization has different data structures and does not have a uniform data structure.

Disclosure of Invention

In view of the above, the present invention provides a data standardization method, apparatus and electronic device, so as to solve the problem that the credit investigation data of the third-party organization has different data structures and does not have a uniform data structure.

In order to solve the technical problems, the invention adopts the following technical scheme:

a method of data normalization, comprising:

acquiring a message to be processed and a data source of the message to be processed;

carrying out format conversion on the message to be processed, and converting the message to be processed into a middle message with a preset format;

and according to the field analysis rule of the message to be processed and the data source, carrying out field analysis and standardized processing on the intermediate message to obtain standardized data.

Preferably, according to the field analysis rule of the message to be processed and the data source, performing field analysis and standardization processing on the intermediate message to obtain standardized data, including:

according to the field analysis rule and the data source of the message to be processed, carrying out field analysis and configuration on the intermediate message to obtain the content of a preset identification field;

adding the preset identification field and the content of the preset identification field into the intermediate message to obtain a target message;

according to the preset identification field and the content of the preset identification field, carrying out field analysis on the target message, and analyzing to obtain a corresponding relation between a field path name and a field value;

and carrying out name standardization processing on the field path names in the corresponding relation to obtain standardized data.

Preferably, after the name standardization processing is performed on the field path name in the corresponding relationship to obtain standardized data, the method further includes:

storing the standardized data and setting the data validity period of the standardized data;

acquiring query time for querying the standardized data;

if the query time is within the data validity period, returning a query result comprising the standardized data;

and if the query time is not within the data validity period, returning a query result representing query failure.

Preferably, after the content of the preset identification field and the preset identification field is added to the intermediate message to obtain the target message, the method further includes:

acquiring a plurality of different messages to be integrated;

screening out at least one message to be integrated, the content of which is the same as that of at least one preset identification field in the target message;

and integrating the screened messages to be integrated with the target messages.

Preferably, after performing field analysis on the intermediate packet according to the field analysis rule and the data source of the packet to be processed to obtain the content of the preset identification field, the method further includes:

and identifying the content of an error field in the intermediate message, and setting the content of the error field as preset data.

A data normalization apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a message to be processed and a data source of the message to be processed;

the format conversion module is used for carrying out format conversion on the message to be processed and converting the message to be processed into a middle message with a preset format;

and the data processing module is used for carrying out field analysis and standardized processing on the intermediate message according to the field analysis rule of the message to be processed and the data source to obtain standardized data.

Preferably, the data processing module includes:

the data processing submodule is used for carrying out field analysis and configuration on the intermediate message according to the field analysis rule and the data source of the message to be processed to obtain the content of a preset identification field;

the data adding submodule is used for adding the preset identification field and the content of the preset identification field into the intermediate message to obtain a target message;

the analysis submodule is used for carrying out field analysis on the target message according to the preset identification field and the content of the preset identification field, and analyzing to obtain the corresponding relation between the field path name and the field value;

and the standardization processing submodule is used for carrying out name standardization processing on the field path names in the corresponding relation to obtain standardized data.

Preferably, the method further comprises the following steps:

the data setting module is used for the standardization processing submodule to carry out name standardization processing on the field path name in the corresponding relation, storing the standardized data after the standardized data is obtained, and setting the data validity period of the standardized data;

the query acquisition module is used for acquiring query time for querying the standardized data;

the first result feedback module is used for returning a query result comprising the standardized data if the query time is within the data validity period;

and the second result feedback module is used for returning the query result representing the query failure if the query time is not within the data validity period.

Preferably, the method further comprises the following steps:

the message acquisition module is used for the data adding submodule to add the preset identification field and the content of the preset identification field into the intermediate message to obtain a target message and then acquire a plurality of different messages to be integrated;

the message screening module is used for screening out a message to be integrated, the content of at least one field of which is the same as the content of at least one preset identification field in the target message;

and the message integration module is used for integrating the screened messages to be integrated with the target messages.

An electronic device, comprising: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to:

Compared with the prior art, the invention has the following beneficial effects:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method for data normalization according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for data normalization according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method for data normalization according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of a data normalization apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a data normalization method, and with reference to fig. 1, the data normalization method may include:

s11, acquiring a message to be processed and a data source of the message to be processed;

the message to be processed may be credit investigation data sent by a third party organization, and the data source may be information such as a name and a number of the third party organization. If the data can be the sesame credit data sent by the sesame credit, the data source is the sesame credit.

S12, converting the format of the message to be processed into a middle message with a preset format;

specifically, the format of the to-be-processed message is generally a non-json (JavaScript object notation) format, and for example, the to-be-processed message may be an extensible markup language xml format, at this time, the to-be-processed messages in different formats are all converted into intermediate messages having a preset format, and the preset format may be a json format.

And S13, according to the field analysis rule of the message to be processed and the data source, carrying out field analysis and standardization processing on the intermediate message to obtain standardized data.

The field analysis rule of the message to be processed may be an interface description document sent by a third-party organization for explaining the meaning of each field in the message to be processed.

The standard data may be a relational data table including the contents of the fields in the pending message.

In this embodiment, after the to-be-processed messages with different data structures are obtained, the format of the to-be-processed messages is converted to obtain the intermediate message, and then the intermediate message is subjected to field analysis and standardized processing to obtain standardized data, so that the to-be-processed messages with different data structures can be subjected to standardized processing to obtain a data structure with a uniform format.

Alternatively, on the basis of the foregoing embodiment, referring to fig. 2, step S13 may include:

s21, according to the field analysis rule and the data source of the message to be processed, carrying out field analysis on the intermediate message to obtain the content of a preset identification field;

specifically, the preset identification field may be a credit channel number, a person name, an identification number, a mobile phone number, query time, an application order number, or the like.

The preset identification field can be a credit investigation identification index field, and the credit investigation identification index field comprises two identification fields, one is an identification field for representing a third-party organization, such as a data interface, a credit investigation channel number, an application order number and the like, and the other is a common field extracted in advance for different messages to be processed, such as a person name, an identity card number, a mobile phone number and the like.

The content of the identification field representing the third-party organization can be obtained according to the data source configuration of the message to be processed, for example, the content of the data interface of the third party is set to be DNBBJHV, and the content of the credit channel number is set to be 1231212.

Specifically, according to a field analysis rule of the message to be processed, the content represented by each field in the message to be processed can be known, and then the content of the common field can be obtained, if the common field is an identity card number, and in the message to be processed, idcard represents the identity card number, the content of the idcard is used as the content of the identity card number of the common field.

It should be noted that the field names of the common fields may be the same as or different from the field names of the corresponding fields in the message to be processed.

Optionally, on the basis of this embodiment, after step S21, the method may further include:

Specifically, the error field content may be the field content with "NULL", which may be changed to 9999, so that the data may be known to be error data later when used.

S22, adding the preset identification field and the content of the preset identification field into the intermediate message to obtain a target message;

specifically, a preset identification field and the content of the preset identification field may be added to the forefront of the content of the middle packet.

S23, according to the preset identification field and the content of the preset identification field, carrying out field analysis on the target message, and analyzing to obtain the corresponding relation between the field path name and the field value;

specifically, a data flow real-time processing technology is adopted, and the target message is pushed to the KAFKA server to carry out real-time data standardization and structuralization.

Using a storm framework to firstly carry out field analysis on a target message to obtain the corresponding relation between a field path name and a field value, and then changing the last letter of the field path name which represents a field with a preset identification field in the field path name into the preset identification field to obtain the final corresponding relation between the field path name and the field value. The correspondence may be presented in the form of a field list, as shown in table 1. The obtained corresponding relation can be stored in hbase.

Table 1 table of correspondence between field path name and field value

It should be noted that, for a field of an array type, a field "field path _ length" field may be added to identify the array length. Meanwhile, in order to avoid the problem of excessive fields caused by the overlong array, only the first 10 pieces of data of the array are stored in the corresponding relation.

And S24, carrying out name standardization processing on the field path names in the corresponding relation to obtain standardized data.

Specifically, hive is used for mapping, and data in hbase is mapped into a relational data table; while field pathnames are normalized during the mapping process.

The method of normalizing the fields is as follows:

the external credit investigation interfaces are uniformly written according to two layers of product abbreviation and interface abbreviation. For example: the interface for applying fraud scoring under sesame credit, corresponding abbreviations "zmxy" and "sqqz".

And carrying out name standardization processing on the field path name in each interface. The processing rule is "product abbreviation _ interface abbreviation _" + standardized field path name.

In addition, the field storage types of all data sources can be unified.

For example, the following fields are standardized:

TABLE 2 field standardization scheme

Original field name	Standardized field names
		response_score	zmxy_sqqz_fraudScore
response_errorMessage	zmxy_sqqz_errorMessage
		response_errorCode	zmxy_sqqz_errorCode

In the embodiment, the purchased third-party credit investigation data is ensured to be effective and completely retained.

In addition, a structuring method for a json message for third party credit investigation is provided and is realized based on a data stream real-time processing technology, so that data can be converted into a relational form, namely a two-dimensional table form, from a semi-structural form of json in real time.

And thirdly, standardized naming and standardized storage methods for third-party credit investigation data with different sources are provided, so that the cost of data analysis and modeling is reduced, and the utilization efficiency of big data is improved.

Optionally, on the basis of the previous embodiment, after the step S24, the method may further include:

s35, storing the standardized data and setting the data validity period of the standardized data;

specifically, some data can not be changed in a short time, such as data of names, identification numbers, academic calendars and the like, the data are inquired in a short time, the obtained result is the same, further, the standardized data can be put into a cache data table, different validity periods are set for different external credit investigation interfaces, the external data source is not inquired any more for the follow-up repeated inquiry of the same interface of the same client in the validity period, the repeated inquiry of the external credit investigation interfaces is avoided, and the inquiry cost is saved. The data validity period may be a data validity expiration date.

S36, acquiring query time for querying the standardized data;

s37, if the query time is within the data validity period, returning a query result comprising the standardized data;

and S38, if the query time is not within the data validity period, returning a query result representing the failure of query.

Specifically, the query of the normalized data is processed according to the following logic:

and inquiring the cache database, if the corresponding credit investigation data of the client does not exist, returning an inquiry result representing inquiry failure, and directly inquiring the external credit investigation service.

If there is standardized data cached by the client accordingly:

if the query time is within the data validity period, returning a query result comprising the standardized data; and if the query time is not within the data validity period, returning a query result representing query failure.

In addition, the data validity period may be a data validity time, and if the data is valid within one year, a creation time field "createtime" needs to be obtained at this time, and this field is added when the standardized data is put into the cache data table. This field is summed with the interface cache expiration date stored in the parameter table and compared to the current time. And if the result is larger than or equal to the current time, returning the cached standardized data. And if the result is less than the current time, inquiring the external credit investigation service.

In the embodiment, different validity periods are set for different external credit investigation interfaces, and subsequent repeated inquiry of the same interface of the same client in the validity period does not inquire an external data source any more, so that repeated inquiry of the external credit investigation interfaces is avoided, and in addition, a cache mechanism is added, so that the fund is saved for a financial institution using third-party credit investigation data.

It should be noted that, in steps S31-34 of this embodiment, please refer to the corresponding descriptions in the above embodiments, which are not described herein again.

Optionally, on the basis of the embodiment corresponding to fig. 2 or fig. 3, adding the preset identification field and the content of the preset identification field to the intermediate message to obtain the target message, further including:

1) acquiring a plurality of different messages to be integrated;

the messages to be integrated are all target messages added with preset identification fields and the content of the preset identification fields.

2) Screening out at least one message to be integrated, the content of which is the same as that of at least one preset identification field in the target message;

3) and integrating the screened messages to be integrated with the target messages.

Specifically, whether the field contents of each preset identification field are the same or not can be sequentially compared, for example, whether the names and the identification numbers of the comparison personnel are the same or not, and if the field contents of one preset identification field are the same, the message to be integrated and the target message data are the same as the message of the same user. At this time, messages belonging to the same user may be integrated. If the content of the target message is a scholastic calendar, the content of the message to be integrated is a growth experience, and the two messages can be integrated into one message.

In the embodiment, the data among different credit investigation institutions is effectively integrated by uniformly adding credit investigation identification index fields to heterogeneous messages from different credit investigation data sources, so that uniform analysis and modeling of the credit investigation data of each third party become possible.

Optionally, on the basis of the above embodiment of the data normalization method, another embodiment of the present invention provides a data normalization apparatus, and with reference to fig. 4, the data normalization apparatus may include:

an obtaining module 101, configured to obtain a to-be-processed packet and a data source of the to-be-processed packet;

a format conversion module 102, configured to perform format conversion on the to-be-processed packet, and convert the to-be-processed packet into a middle packet with a preset format;

and the data processing module 103 is configured to perform field analysis and standardization processing on the intermediate packet according to the field analysis rule of the packet to be processed and the data source, so as to obtain standardized data.

It should be noted that, for the working process of each module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.

Optionally, on the basis of the foregoing embodiment, the data processing module includes:

Further, still include:

and the data correction submodule is used for the data processing submodule to carry out field analysis and configuration on the intermediate message according to the field analysis rule and the data source of the message to be processed, identify the content of an error field in the intermediate message after the content of a preset identification field is obtained, and set the content of the error field as preset data.

It should be noted that, for the working processes of each module and sub-module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.

Optionally, on the basis of the above embodiment, the method further includes:

Optionally, on the basis of the above embodiment that includes the data adding sub-module, the method further includes:

Optionally, on the basis of the embodiments of the data normalization method and apparatus, another embodiment of the present invention provides an electronic device, which may include: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to:

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of data normalization, comprising:

according to the field analysis rule of the message to be processed and the data source, carrying out field analysis and standardized processing on the intermediate message to obtain standardized data, wherein the method comprises the following steps: and processing the intermediate message according to the field analysis rule and the data source of the message to be processed to obtain the corresponding relation between the field path name and the field value, and carrying out name standardization processing on the field path name in the corresponding relation to obtain standardized data.

2. The data standardization method of claim 1, wherein the step of processing the intermediate packet according to the field parsing rule and the data source of the packet to be processed to obtain the corresponding relationship between the field path name and the field value comprises:

and according to the preset identification field and the content of the preset identification field, carrying out field analysis on the target message, and analyzing to obtain the corresponding relation between the field path name and the field value.

3. The data normalization method according to claim 1, wherein after the name normalization processing is performed on the field path names in the correspondence relationship to obtain normalized data, the method further comprises:

acquiring query time for querying the standardized data;

4. The data normalization method of claim 2, wherein the steps of adding the preset identification field and the content of the preset identification field to the intermediate message to obtain the target message further comprise:

acquiring a plurality of different messages to be integrated;

5. The data standardization method of claim 2, wherein after performing field parsing on the intermediate packet according to the field parsing rule and the data source of the packet to be processed to obtain the content of the preset identification field, the method further comprises:

6. A data normalization apparatus, comprising:

the data processing module is configured to perform field analysis and standardized processing on the intermediate packet according to the field analysis rule of the packet to be processed and the data source, so as to obtain standardized data, where the data processing module includes: and processing the intermediate message according to the field analysis rule and the data source of the message to be processed to obtain the corresponding relation between the field path name and the field value, and carrying out name standardization processing on the field path name in the corresponding relation to obtain standardized data.

7. The data normalization apparatus of claim 6, wherein the data processing module comprises:

8. The data normalization apparatus of claim 7, further comprising:

9. The data normalization apparatus of claim 7, further comprising:

10. An electronic device, comprising: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to: