CN108228432A

CN108228432A - A kind of distributed link tracking, analysis method and server, global scheduler

Info

Publication number: CN108228432A
Application number: CN201611140248.3A
Authority: CN
Inventors: 冯书志; 任震宇
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-12-12
Filing date: 2016-12-12
Publication date: 2018-06-29

Abstract

A kind of distributed link tracking, analysis method and server, global scheduler, server in distributed type assemblies is during processing user's request, it is supervened when generating details event for the metaevent of details event searching, information comprising tracking mark and event generation time in the metaevent, and data volume is less than the data volume of corresponding details event；The details event and metaevent of generation are separately maintained in local by the server.Server transmission link inquiry request of the global scheduler into the distributed type assemblies in distributed type assemblies, receive that the server returns after local search identifies relevant event information with the tracking；And the user is asked based on the event information received to carry out link analysis.The application takes the mode for being locally stored, analyzing on demand, eliminates central repository and large database, solves the of high cost and scaling concern of data storage, analysis and network bandwidth.

Description

Distributed link tracking and analyzing method, server and global scheduler

Technical Field

The present invention relates to computer technologies, and in particular, to a distributed link tracking and analyzing method, a server, and a global scheduler.

Background

Tracing (tracing) is a technique by which a computer system traces the behavior of a system, mostly for debugging, monitoring and tracing purposes. It is similar to logging, but has better performance, lower interference to target systems, wider coverage, and is often used to record higher frequency, lower layer events.

In the field of Tracing in computer science, an "event" (hereinafter, referred to as a detail event) refers to a process that, when a processor runs to a specific code point, triggers a series of operations due to the satisfaction of a condition set in advance at the code point, and has recognizability. For example, a user sets a boolean variable in advance somewhere in the program code, and when the processor runs to this point and detects that the boolean variable is true, an event is triggered to perform a series of operations scheduled in advance. These operations are usually to record the relevant information of the event in the log, such as the event name, the event occurrence time, the thread ID, other data carried by the event, and the like.

"instrumentation," or "event," refers to the user's placement of a piece of logic (set by the user) somewhere in the code of an application (selected by the user), typically to record various information contained by the event. The dynamic record implantation is characterized in that code for tracing is not required to be added in program code in advance, and a special instruction set is added to a code position (such as the head and the tail of a function) designated by a user during the running of a target program by an operating system kernel or a privileged process, so that the aim of running the code logic set by the user is fulfilled. Static instrumentation is the user's modification of application source code to add code logic for tracing. The embedded program is called an embedded server program (embedded server).

Some events are non-independent and related to each other. For example, the user application sends a data packet using TCP protocol (event a), the TCP data segment is transmitted through the local network card (event B), and the TCP detects the data segment loss and retransmits the data segment (event C). If events containing time stamps and carrying useful information can be recorded at the above-mentioned key points during the whole retransmission segment (A, B, C), and the events with effect relationship are sorted out from the total set of recorded events in order (such as time sequence), quantitative explanation can be given to the transmission delay of the user data packet. The above process of picking and ordering events having a relationship with each other is called "event correlation".

Contemporary internet services are typically implemented using large-scale distributed clusters, and a user's request is likely to be distributed to one or more clusters for processing. Each server in the cluster may assume different roles during the processing of the request, and finally complete the request in tandem and return the results to the user. In order to better understand the entire service/system, locate faults, and optimize performance, developers/administrators typically use a tracking technique to record information about operations on each server, and then aggregate and analyze the recorded information. Such as the time, resources, etc. that a single user spends requesting various processing stages within its lifecycle. When the server processes the user request, the process of generating and recording the relevant information is called 'link tracking', and then the process of analyzing the information data is called 'link analysis'.

There are two solutions in the industry as to how events are correlated-black box tracking and white box tracking. The black box solution has the advantages of portability, no modification of application program codes and complete transparency to an application party. It either uses statistical methods to obtain information from the communication message packet-and is therefore inaccurate, or it follows accurately from the mutual messages between the collected function modules, based on the causal relationships inherent between the messages. It has the disadvantage that it may be somewhat inaccurate and in addition has a greater resource consumption in inferring the association. Tracking systems that take the black box approach are vPath, BoarderPartrol, PreciseTracer, etc. The white-box approach requires code implantation (instrumentation), which is a disadvantage. The method has the advantages of simple inference correlation and accurate event correlation. Tracking systems adopting the white-box scheme include X-Trace, PinPoint, Magpie, Dapper, eagle, and the like.

Trace identification (traceID) is a global identifier used in a simple event correlation method adopted by most white-box solutions. The tracking system generates a traceID for each request sent by the user, which traceID is associated with the request and is recorded and saved with the event information at the time of triggering each event. Thus all events associated with a request can be concatenated with the traceID.

The user requests may be divided into type I user requests and type II user requests. A type II user request will only involve a few machines in the cluster. Write requests such as distributed storage are typically written to 3 servers; while the read request of the distributed storage only needs to read one server. User requests that do not comply with this feature are type I requests, such as search requests. A search request may involve nearly all servers in the cluster (if most of the servers have documents containing the search key).

The distributed link tracking and analysis system needs to keep track of all the operations the system completes during a certain user request for service. For example, FIG. 1 illustrates a distributed request service process associated with 5 servers. The server involved includes: front-end server A, middle tier servers B and C, and back-end servers D and E. When a user initiates a request, the request firstly reaches a front-end server A, and the server A sends two remote procedure Call Protocol (RPC) messages to servers B and C; the server B will respond immediately, but the server C needs to interact with the servers D and E at the back end and then returns a response to the server A, and finally the server A responds to the initial user request. For such a request service process, the implementation of distributed link tracking and analysis in the related art is that server a generates a traceID for the user request, and then records event information including the traceID and a timestamp on each operation of each server. With the traceID, each event associated with the user request may be correlated in a subsequent analysis of the data.

The implementation of the distributed link tracking and analysis system of the related art, such as Dapper by Google, eagle by naught, all employ a three-stage process. The first stage is that when each server reaches a pre-buried code path point in the request service process, the event information related to the request and the code path point is recorded, and is generally written into a local file; in the second stage, the daemon process or the log collecting program pulls the information data out of each server and transmits the information data to a central warehouse through a network; the third phase is to fetch data from the central repository for subsequent analysis, querying, correlation and computation. This stage in turn introduces large databases between the central repository and the analytical computation, such as mysql, HBase, Infobright, HiStore, etc. databases are used by link tracking systems in the third stage.

One problem with this approach is the high cost of data storage and network bandwidth. The raw information data on each server needs to be transmitted to the central warehouse for storage through the network, so that when the number of servers in the cluster is large and the raw information data amount of a single server is also large, the cost of storage and network bandwidth is very high (for example, on the order of TB per day), and great pressure is exerted on the central warehouse. These network transmissions also have some overhead (overhead) for the cluster production servers. Furthermore, scalability is also a problem, as the central repository must increase with the number of servers within the cluster and the overall QPS (how many requests a server processes per second), so the central repository may become a bottleneck point. After massive data is stored in a central warehouse, a large database at TB level is needed to meet efficient query of the data, and a large amount of CPU resources are introduced. Finally, one is generally more concerned with data that is less heavily populated with anomalous requests, and typically only looks at the link status of those anomalous requests, making this approach less cost effective.

Disclosure of Invention

In view of this, the present invention provides the following solutions:

a distributed link tracking method, comprising:

in the process of processing a user request, a server in a distributed cluster generates a meta-event for searching a detail event along with the generation of the detail event, wherein the meta-event comprises a tracking identifier and information of an event generation time, and the data volume is less than that of the corresponding detail event;

the server stores the generated detail event and meta event separately in local.

A server in a distributed cluster, comprising a link tracking module, the link tracking module comprising:

an event generating unit configured to: generating a meta event for searching a detail event while generating the detail event in processing a user request; the meta-event comprises a tracking identifier and information of an event generation time, and the data volume is less than that of the corresponding detail event;

an event storage unit configured to: and storing the generated detail event and meta event separately in a local place.

A server in a distributed cluster, comprising a processor and a memory, wherein:

the memory, configured to: saving the program code;

the processor is configured to: reading the program code and performing the following link trace processing:

in the process of processing a user request, generating a meta-event for searching a detail event when the detail event is generated, wherein the meta-event comprises a tracking identifier and information of an event generation time, and the data volume is less than that of the corresponding detail event;

and storing the generated detail event and meta event separately in a local place.

According to the distributed link tracking method and the server, the meta-event for searching the detail event is generated when the detail event is generated, and the event is stored locally, so that the detail event can be quickly searched locally through the meta-event in the link analysis process, the event information does not need to be stored in a data warehouse, and large data does not need to be established.

In view of this, the present invention also provides the following solutions:

a distributed link analysis method, comprising:

when a global scheduler in a distributed cluster needs to perform link analysis on a user request, sending a link query request to a server in the distributed cluster, wherein the link query request carries a tracking identifier of the user request;

the global scheduler receives event information which is returned by the server after local search and is related to the tracking identification;

and the global scheduler performs link analysis on the user request based on the received event information related to the tracking identification.

A global scheduler in a distributed cluster, comprising:

a link query module configured to: when a user request needs to be subjected to link analysis, a link query request is sent to a server in the distributed cluster, and a tracking identifier of the user request is carried; receiving event information which is returned by the server after local search and is related to the tracking identification;

a link analysis module configured to: and performing link analysis on the user request based on the received event information related to the tracking identification.

A global scheduler in a distributed cluster, comprising a processor and a memory, wherein:

the memory, configured to: saving the program code;

the processor is configured to: reading the program code and performing the following:

when a user request needs to be subjected to link analysis, a link query request is sent to a server in the distributed cluster, and a tracking identifier of the user request is carried;

receiving event information which is returned by the server after local search and is related to the tracking identification;

and performing link analysis on the user request based on the received event information related to the tracking identification.

The distributed link analysis method and the global scheduler adopt the modes of local storage and analysis on demand, a central warehouse and a large database are removed, and the problems of high cost and expansibility of data storage, analysis and network bandwidth in the traditional link tracking analysis system are solved.

In view of this, the present invention also provides the following solutions:

a distributed link analysis method, comprising:

a server in a distributed cluster receives a link query request sent by a global scheduler, wherein the link query request carries a tracking identifier of a user request;

the server locally searches meta-events related to the tracking identification, determines a first search time window of detail events according to event generation time information contained in the searched meta-events, and locally searches the detail events related to the tracking identification according to the first search time window;

and the server returns the searched information of the detail event to the global scheduler.

A server in a distributed cluster, comprising a link analysis module, the link analysis module comprising:

a first query interface unit configured to: receiving a link query request sent by a global scheduler, wherein the link query request carries a tracking identifier of a user request; and returning the searched information of the detail event to the global scheduler;

a meta-event search unit configured to: locally searching for meta-events related to the tracking identity;

a detail event search unit configured to: and determining a first search time window for the detail event according to event generation time information contained in the meta event searched by the meta event searching unit, and locally searching the detail event related to the tracking identifier according to the first search time window.

the memory, configured to: saving the program code;

receiving a link query request sent by a global scheduler, wherein the link query request carries a tracking identifier of a user request;

locally searching meta-events related to the tracking identification, determining a first searching time window of detail events according to event generation time information contained in the searched meta-events, and locally searching the detail events related to the tracking identification according to the first searching time window;

and returning the searched information of the detail event to the global scheduler.

According to the link analysis method and the server, the search time window of the detail event is obtained by searching the meta event, so that the search range of the detail event can be greatly reduced, and the link analysis process is accelerated.

Drawings

FIG. 1 is an exemplary diagram of a servicing process of a distributed request in a distributed cluster;

FIG. 2 is a flow chart of a distributed link tracking method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a link tracking module in a server according to an embodiment of the present invention;

FIG. 4 is a flowchart of a distributed link analysis method on the global scheduler side according to an embodiment of the present invention;

FIG. 5 is a block diagram of a global scheduler, according to an embodiment of the present invention;

FIG. 6 is a flow chart of a method for three server side distributed link analysis in accordance with an embodiment of the present invention;

fig. 7 is a unit structure diagram of a link analysis module in a third server according to an embodiment of the present invention;

FIG. 8 is an overall schematic diagram illustrating the link analysis process for a two-pair type I request of the present invention;

FIG. 9 is a schematic diagram of two-stage processing of a two-server local search event in accordance with an exemplary embodiment of the present invention;

FIG. 10 is a schematic diagram of a first optimization of a second first stage process of the present invention;

FIG. 11 is a schematic diagram of a second optimization scheme of a second first stage process of the present invention;

FIG. 12 is a schematic diagram of an exemplary second stage process of the present invention;

FIG. 13 is an overall schematic diagram of the link analysis process of the exemplary three pairs of type II requests of the present invention;

FIG. 14 is a schematic view of an exemplary three two-stage process and downstream track process of the present invention;

FIG. 15 is a schematic diagram of an exemplary three downstream trace process of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

For convenience of description, in the present application, event information recorded in a detail event is also referred to as a "detail event" and event information recorded in a meta event is also referred to as a "meta event" when referring to contents included in event information and saving of event information.

In this document, in the process of the distributed cluster serving a user request, the server generating the tracking identifier for the user request is referred to as an origin server of the user request.

The embodiment of the invention adopts the modes of local storage and on-demand computation to solve the problems of high cost and expansibility of data storage, analysis and network bandwidth in the traditional distributed link tracking analysis system.

Example one

The embodiment relates to a white-box distributed link tracking analysis system based on tracking identification, which comprises a global scheduler and a server in a distributed cluster, wherein events generated in the process of processing a user request by the server are stored in the local of the server, and when link analysis is carried out, the global scheduler collects corresponding event information to the server to carry out link analysis, so that a data warehouse and a large database for link tracking analysis are not established any more.

The embodiment relates to a distributed link tracking method and a corresponding server, wherein the distributed link tracking mainly relates to generation and storage of events and is the basis of distributed link analysis.

The distributed link tracking method of the present embodiment is shown in fig. 2, and includes:

step 110, in the process of processing a user request, a server in the distributed cluster generates a meta-event for searching a detail event when the detail event is generated, wherein the meta-event comprises a tracking identifier and information of an event generation time, and the data volume is less than that of the corresponding detail event;

in order to distinguish the two types of events, the present application refers to "events" in the prior art tracking technology as "detail events", and refers to events generated along with the detail events as "meta events". The meta-event is used for searching the detail event and does not need to contain information of specific operation, so that the data volume of the meta-event is far smaller than that of the corresponding detail event, and the searching speed is higher than that of the detail event.

In the application, the server generates a detail event and a meta-event corresponding to the detail event, so that the meta-event is generated along with the detail event. It should be noted that "simultaneously" here does not require that the two occur at exactly the same time, allowing for small differences. For example, by means of program code design, the difference between the actual generation time of a detail event and the meta-event generated along with the detail event is always less than 10ms, and only a time value of more than 100ms is recorded when the event generation time is recorded, and the same value can be recorded when the small difference between the actual generation time of the detail event and the event generation time of the meta-event is recorded. In performing the link analysis, the event generation time of the meta-event may be used to determine a search time window of the corresponding detail event.

When a link analysis is performed on a certain user request, in order to avoid searching for an event in the whole storage directory, it is necessary that the server first determines a search time window to narrow the search range, and the starting time of the search time window may be set as the generation time of the tracking identifier requested by the user. In this embodiment, the source server encodes the information of the time when the tracking identifier is generated into the generated tracking identifier, and the tracking identifier is issued to the server in the link analysis, so that the server can very conveniently obtain the start time of the search time window from the tracking identifier. Of course, in other embodiments, the trace identifier may not include the information of the generation time of the trace identifier, for example, the global scheduler may first query the source server to obtain the information of the generation time of the trace identifier, and then carry the information in the link query request sent to the server. That is, the information of the generation time of the trace identifier may be included in the trace identifier carried in the link query request, or may be included in other information units carried in the link query request.

The duration of the search time window used in the link analysis (end time-start time) may be set to the maximum life cycle of the user request, for example, 10 minutes, and the call duration of all the user requests may not exceed this duration. To further narrow the search, the duration of the search time window may be set to the duration of the invocation of the user request to perform the link analysis. In this embodiment, the source server generates and stores an Application Programming Interface (API) event when the user request is completed, where the API event includes a tracking identifier and call duration information of the user request, and the API event may be a special meta event accompanied by a detail event, but is not limited to this.

In particular, the source server stores a timeout API event in the API events in a private directory of the timeout API event, where the timeout API event refers to an API event whose call duration requested by the user exceeds a corresponding timeout duration. The timeout durations corresponding to different user requests may be different, but are all smaller than the maximum life cycle set by the system for the user request. Typically, timeout API events are at least an order of magnitude less than API events, and link analysis is in most cases initiated for exception requests, such as timeout requests, so searching under a proprietary directory of timeout API events can greatly speed up the search for timeout requests. When the timeout API event is not searched, it is indicated that the calling duration of the user request is not timeout, and the timeout duration corresponding to the user request can be used as the calling duration of the user request to be returned to the global scheduler.

In another embodiment, the API events (including the timeout API event) are all stored in the API event storage directory, and the source server may directly search the API events in the API event storage directory to obtain the call duration requested by the user. In another embodiment, other API events except the timeout API event are stored in an API event storage directory, the timeout API event is stored in a timeout API event specific directory (which may be a subdirectory of the API event storage directory or an independent directory), and when the timeout API event is not searched, a common API event is searched in the API event storage directory to obtain the call duration requested by the user. The API event storage directory may be a subdirectory under the meta event storage directory or may be independent of the meta event storage directory.

By storing the API events, the global scheduler may query the source server for the call duration information of the user request stored therein during the subsequent link analysis process, and may accurately determine the search time window for the meta-event.

In order to realize the query of the global scheduler to the source server, when the source server generates the trace identifier, the address information of the source server may be encoded into the generated trace identifier. The address information of the server may be direct address information such as an IP address of the server, or indirect address information such as a server ID and the like, which can be used to find the IP address of the server. In other embodiments, however, the source server may report its address information to the global scheduler in other manners, such as by a message.

In this embodiment, if the user request is a distributed read request or a distributed write request, the processing of the user request only involves a small number of servers in the distributed cluster, and the subsequent link analysis may be performed for link query only for the small number of servers. In order to identify the small number of servers, the server generates a communication event when performing network communication processing for sending the user request to a next-hop server, wherein the communication event comprises the tracking identifier and address information of the next-hop server to which the user request is sent. During link analysis, the global scheduler may only initiate a query to a source server (e.g., a front-end server), and the source server initiates a query to a next-hop server for processing the user request according to address information of the next-hop server stored in the communication event, and each server on the path for processing the user request is processed in this way, and the query to the link may be completed along the path for processing the user request.

In this embodiment, the meta-event further includes identification information of a processor that generates the meta-event, that is, only includes three kinds of information, namely, a tracking identification, an event generation time, and an identification of the processor that generates the meta-event. Or may include only tracking identification and event generation time information.

And step 120, the server stores the generated detail event and meta event separately in the local.

In this embodiment, the server sequentially stores the detail events and the meta events in files in respective directories according to the sequence of the event generation time, and transfers to the next file for storage after one file reaches a set maximum size. Because of the sequential storage, the time period in which the event generation time of the meta-event (or detail event) saved in the file is determined according to the creation time and/or the last modification time of the file.

In this embodiment, when the server stores the detail event locally, the server stores the detail event generated by the same processor in a file group corresponding to the processor, and different processors correspond to different file groups. Therefore, during link analysis, the server can determine the file group where the corresponding detail event is located (namely, the file group corresponding to the processor with the identification information) according to the identification information of the processor stored in the meta event, and further narrow the search range of the detail event to the range of the file groups, so that the search range can be greatly narrowed, and the search speed is accelerated.

This embodiment also provides a server in a distributed cluster, including a link tracking module, as shown in fig. 3, where the link tracking module includes:

an event generating unit 10 configured to: generating a meta event for searching a detail event while generating the detail event in processing a user request; the meta-event comprises a tracking identifier and information of an event generation time, and the data volume is less than that of the corresponding detail event;

an event storage unit 20 configured to: and storing the generated detail event and meta event separately in a local place.

In this embodiment, the link tracking module may further include: an identification generation unit configured to: generating a tracking identifier for a user request, and encoding information of the generation moment of the tracking identifier into the tracking identifier, or encoding the information of the generation moment of the tracking identifier and the address information of the server into the tracking identifier.

In this embodiment, the storing the detail event and the meta event separately in the local by the event storage unit may include: and sequentially storing the detail events and the meta events in files under respective directories according to the sequence of the event generation time, and transferring to the next file for storage after one file reaches the set maximum size.

In this embodiment, the meta-event generated by the event generating unit may include identification information of a processor that generates the meta-event; the event storage unit stores the detail event locally, and may include: and storing the detail events generated by the same processor in a file group corresponding to the processor, wherein different processors correspond to different file groups.

In this embodiment, the event generating unit may be further configured to: when the user request is completed, generating an Application Programming Interface (API) event, wherein the API event comprises the tracking identifier and the calling duration information of the user request; the event storage unit is further configured to store the API event.

In this embodiment, the storing the API event locally by the event storage unit may include: and storing the overtime API events in the API events in a special directory of the overtime API events, wherein the overtime API events refer to API events of which the calling duration requested by a user exceeds the overtime duration.

In this embodiment, the event generating unit may be further configured to: generating a communication event when network communication processing is carried out, wherein the network communication processing is used for sending the user request to a next hop server, and the communication event comprises the tracking identifier and address information of the next hop server to which the user request is sent; the user request is a user request for a distributed read or a distributed write.

The functions executed by each unit of the link tracking module in the server in this embodiment may also refer to the detailed description in the link tracking method.

The present embodiment also provides a server in a distributed cluster, comprising a processor and a memory, wherein,

the memory, configured to: saving the program code;

In this embodiment, the link tracking processing executed by the processor may include all the processing in the link tracking method of this embodiment, and a description thereof is not repeated here.

Example two

In the first embodiment, a link tracking method for servers in a distributed cluster has been described, and the servers store generated events locally without uploading the events to a data warehouse. In the distributed link analysis method of this embodiment, the global scheduler performs link query on the server to obtain relevant event information, and then performs summary analysis.

The distributed link analysis method of the present embodiment is used for a global scheduler, and as shown in fig. 4, the distributed link analysis method includes:

step 210, when a global scheduler in a distributed cluster needs to perform link analysis on a user request, sending a link query request to a server in the distributed cluster, wherein the link query request carries a tracking identifier of the user request;

in this embodiment, the global scheduler performs link analysis on the user request, which may be triggered according to a user instruction or may be automatically triggered according to a configured condition, which is not limited to this.

As described above, when searching for an event related to a user request, the server determines a search time window, where the starting time of the search time window may be set as the generation time of the tracking identifier requested by the user, and the duration may be set as the maximum life cycle information requested by the user or the call duration requested by the user. The information of the time of generation of the tracking identity may be included in the tracking identity. If the duration is set to the maximum lifetime requested by the user, the maximum lifetime information may be carried in the link query request. Of course, the configuration information may be the configuration information of the server and need not be carried in the link query request. If the duration is set as the calling duration of the user request, the global scheduler may query the source server for the obtained calling duration information of the user request, and carry the information in the link query request.

In this embodiment, when the types of the user requests are different, the global scheduler adopts different link query modes: if the user request is a search request, the global scheduler sends the link query request to all servers in the distributed cluster; and if the user request is a distributed read request or a distributed write request, the global scheduler sends the link query request to an origin server of the user request.

The distributed clusters can provide different services, for example, some provide read-write services, and some provide search services. The distributed cluster itself may default to a link query. For a distributed cluster which can provide multiple types of services at the same time, the global scheduler may adopt a corresponding link query mode according to the user request type or adopt a corresponding link query mode according to a flag bit indicating a processing mode, for example, 1bit is used as a flag bit in the tracking identifier, when the flag bit is "0", it indicates that the link query request should be sent to all servers in the distributed cluster, and when the flag bit is "1", it indicates that only the link query request needs to be sent to the source server in the distributed cluster.

Step 220, the global scheduler receives event information related to the tracking identifier returned by the server after local search;

step 230, the global scheduler performs link analysis on the user request based on the received event information related to the tracking identifier.

In the distributed trace analysis method of this embodiment, the user requests all have a unique trace identifier, and therefore the event information related to the trace identifier is the event information generated in the process of processing the user requests. The embodiment mainly focuses on the process of acquiring the event information, and does not limit the analysis processing method after receiving the event information.

This embodiment further provides a global scheduler in a distributed cluster, as shown in fig. 5, including:

a link query module 30 configured to: when a user request needs to be subjected to link analysis, a link query request is sent to a server in the distributed cluster, and a tracking identifier of the user request is carried; receiving event information which is returned by the server after local search and is related to the tracking identification;

a link analysis module 40 configured to: and performing link analysis on the user request based on the received event information related to the tracking identification.

In this embodiment, the link query request sent by the link query module may also carry the maximum life cycle information requested by the user, and the tracking identifier includes information of the time when the tracking identifier is generated; or the link query request sent by the link query module may also carry the call duration information of the user request obtained by the global scheduler querying the source server, and the tracking identifier includes information of the generation time of the tracking identifier.

In the present embodiment, the first and second electrodes are,

the user request is a search request, and the link query module sends a link query request to a server in the distributed cluster, including: sending the link query request to all servers in the distributed cluster; or

The user request is a distributed read request or a distributed write request, and the link query module sends a link query request to a server in the distributed cluster, including: and sending the link query request to the source server requested by the user in the distributed cluster.

The present embodiment further provides a global scheduler in a distributed cluster, including a processor and a memory, where:

the memory, configured to: saving the program code;

the processor is configured to: reading the program code and performing the following link analysis processing:

In this embodiment, the link analysis processing executed by the processor in the global scheduler may include all the processing in the link analysis method in this embodiment, and a description thereof is not repeated here.

The distributed link analysis method and the global scheduler in the embodiment adopt a local storage and on-demand analysis mode, a central warehouse and a large database are removed, and the problems of high cost and expansibility of data storage, analysis and network bandwidth in the traditional link tracking analysis system are solved.

EXAMPLE III

The embodiment relates to a link analysis method executed by a server in a distributed cluster, wherein the server searches relevant events locally according to a tracking identifier after receiving a link query request sent by a global scheduler, and returns searched event information to the global scheduler. In order to speed up the search process, the search range can be narrowed down from both the search time window and the file group determined according to the event generation time information in the meta-event and the identification information of the processor.

As shown in fig. 6, the distributed link analysis method of this embodiment includes:

step 310, a server in a distributed cluster receives a link query request sent by a global scheduler, wherein the link query request carries a tracking identifier of a user request;

step 320, the server locally searches the meta-event related to the tracking identifier, determines a first search time window for a detail event according to event generation time information contained in the searched meta-event, and locally searches the detail event related to the tracking identifier according to the first search time window;

in this embodiment, the locally searching for the meta-event related to the tracking identifier by the server includes: searching a second target file where the meta-event with the event generating time falling into a second searching time window is located under the storage directory of the meta-event, and searching the meta-event related to the tracking identification in the second target file; the starting time of the second search time window is the generation time of the tracking identifier, the duration is the call duration of the user request or the maximum life cycle of the user request, and the information of the generation time of the tracking identifier is carried in the link query request, and may specifically be included in the tracking identifier carried in the link query request, or may be included in other information units carried in the link query request, which are different from the tracking identifier.

Wherein,

the server searching the second target file comprises: determining the time period of the event generation time of the meta-event stored in the file according to the creation time and/or the last modification time of the file in the meta-event storage directory; searching the files of which the time period falls into the second searching time window, wherein the searched files are the second target files; and the meta-events are sequentially stored in the files according to the sequence of the event generation time, and when one file reaches a set maximum size, the next file is transferred for storage. In one example, a time period in which an event generation time of a meta-event saved in a file is determined according to a file last modification time, for example, a file having a last modification time of t4 and a file immediately preceding the file having a last modification time of t3, the time period in which the event generation time of the meta-event saved in the file is determined as [ t3, t4 ]; for another example, if the creation time of a file is t5 and the last modification time is t6, the time period during which the event of the meta-event saved in the file occurs may be determined as [ t5, t6 ]; for another example, if the creation time of one file is t7 and the creation time of the file subsequent to the one file is t8, the time period in which the event generation time of the meta-event saved by the one file is determined to be [ t7, t8 ].

Wherein,

if the calling duration requested by the user is taken as the duration of the second search time window, the global scheduler may query the source server to obtain the information of the calling duration requested by the user and carry the information in the link query request sent by the global scheduler.

In the query process of the call duration, the processing executed by the source server comprises the following steps:

the source server of the user request receives a calling duration query request of the global scheduler, wherein the calling duration query request carries the tracking identifier;

in this embodiment, the source server locally searches for the timeout API event related to the tracking identifier, and obtains the information of the call duration from the searched timeout API event and returns the information to the global scheduler. The overtime API event refers to an API event that the calling duration of the user request exceeds the corresponding overtime duration, and the API event comprises the tracking identification and calling duration information of the user request.

Particularly, if the source server does not search the timeout API event related to the tracking identifier locally, the source server may continue searching in the storage directory of other API events, or return the information of the timeout duration corresponding to the user request as the information of the call duration to the global scheduler. The timeout duration corresponding to the user request is used as the calling duration, so that the time for continuously searching the API event can be saved, and meanwhile, the calling duration of the user request returned to the global scheduler is also smaller than the maximum life cycle of the user request, so that the first searching time window is smaller.

In another embodiment, the source server stores the timeout API event and other API events in the same API event storage directory, searches for an API event related to the tracking identifier in the API event storage directory after receiving a call duration query request, and obtains the call duration information from the searched API event and returns the call duration information to the global scheduler.

In particular, the search for API events or timeout API events may be accelerated in the following manner: and searching a third target file in which an API (application program interface) event or an overtime API (application program interface) event of which the event generating time falls into a third searching time window is located, and searching the API event or the overtime API event related to the tracking identifier in the third target file, wherein the starting time of the third searching time window is the generating time of the tracking identifier contained in the tracking identifier, and the duration is the maximum life cycle requested by a user. The searching of the third target file is similar to the searching of the first target file, and the time period of the event generation time of the API event stored in the file is determined according to the creation time and/or the last modification time of the file in the API event storage directory or the overtime API event storage directory; and then searching the files of which the time periods fall into the third searching time window, wherein the searched files are the third target files.

In this embodiment, the determining, by the server, a first search time window for a detail event, and locally searching for the detail event related to the tracking identifier according to the first search time window includes:

the server determines [ t1, t2] as the first search time window, wherein t1, t2 are the earliest time and the latest time of the event generation times contained in the searched meta-event respectively;

and the server searches a first target file where the detail event with the event generation time falling into the first search time window is located in a detail event storage directory, and then searches the detail event related to the tracking identification in the first target file.

Wherein,

during link tracking, the server stores the detail events in the files in sequence according to the sequence of the event generation time, and transfers to the next file for storage after one file reaches the set maximum size. Therefore, the server can determine the time period of the event generation time of the detail event stored in the file according to the creation time and/or the last modification time of the file in the detail event storage directory, and then search the file of which the time period falls into the first search time window, wherein the searched file is the first target file.

In this embodiment, the search is also greatly accelerated by dividing different file groups for different processors. During link tracking, the generated meta-event includes identification information of a processor generating the meta-event, and the detail event generated by the same processor is stored in a file group corresponding to the processor, and different processors correspond to different file groups. During link analysis, the server limits the searched range to the file group corresponding to the processor by the following modes:

the server determines a file group corresponding to the processor according to identifier information of the processor contained in the searched meta-event, searches the first target file in the corresponding file group, and then searches the detail event related to the tracking identifier in the first target file; or

And the server determines a file group corresponding to the processor according to the identifier information of the processor contained in the searched meta-event, selects files belonging to the corresponding file group from the first target file after searching the first target file, and then searches the selected files for the detail event related to the tracking identifier.

There are hundreds of processors in some servers, and only a few of these may be involved in the processor of the meta-event record, which may reduce the search range by a factor of less than 10. The increase in search speed is significant.

Wherein,

in order to avoid this situation, after the server searches for the detail event related to the tracking identifier, the server may further include: counting the number M1 of detail events related to the tracking identification, for example, M1 is smaller than the number M2 of meta events related to the tracking identification searched locally, and then searching other file groups in a detail event storage directory for detail events related to the tracking identification.

In step 330, the server returns the searched information of the detail event to the global scheduler.

If the link query request is a link query request for a search request, the global scheduler may directly send the link query request to all servers in the distributed cluster, and the servers return query results to the global scheduler after local query.

If the link query request is a link query request for a distributed read request or a distributed write request of a user, the server further needs to perform the following processing:

the server inquires a communication event which is locally stored and is related to the tracking identification, wherein the communication event comprises address information of a next hop server to which the user request is sent;

if the server inquires the communication event, a link inquiry request is sent to the next hop server according to the address information in the communication event, and the link inquiry request carries the tracking identifier of the user request;

and the server receives the information of the detail event related to the tracking identification returned by the next-hop server and returns the information to the global scheduler.

This embodiment also provides a server in a distributed cluster, including a link analysis module, as shown in fig. 7, where the link analysis module includes:

a first query interface unit 50 arranged to: receiving a link query request sent by a global scheduler, wherein the link query request carries a tracking identifier of a user request; and returning the searched information of the detail event to the global scheduler;

a meta-event search unit 60 configured to: locally searching for meta-events related to the tracking identity;

a detail event search unit 70 configured to: and determining a first search time window for the detail event according to event generation time information contained in the meta event searched by the meta event searching unit, and locally searching the detail event related to the tracking identifier according to the first search time window.

In this embodiment, the locally searching for the meta-event related to the tracking identifier by the meta-event searching unit may include: searching a second target file where the meta-event with the event generating time falling into a second searching time window is located under the storage directory of the meta-event, and searching the meta-event related to the tracking identification in the second target file; the starting time of the second search time window is the generation time of the tracking identifier, the duration is the calling duration of the user request or the maximum life cycle of the user request, and the information of the generation time of the tracking identifier is carried in the link query request.

In this embodiment, the searching for the second target file by the meta-event searching unit may include: determining the time period of the event generation time of the meta-event stored in the file according to the creation time and/or the last modification time of the file in the meta-event storage directory; searching the file with the time period falling into the second searching time window, namely the second target file; and the meta-events are sequentially stored in the files according to the sequence of the event generation time, and when one file reaches a set maximum size, the next file is transferred for storage.

In this embodiment, the link analysis module may further include:

a call duration storage unit configured to: storing an Application Programming Interface (API) event or an overtime API event, wherein the API event comprises a tracking identifier of a user request and information of calling duration, and the overtime API event refers to an API event that the calling duration of the user request exceeds the corresponding overtime duration;

a call duration search unit configured to: receiving a calling duration query request sent by the global scheduler, locally searching an API event or an overtime API event related to a tracking identifier carried in the calling duration query request, acquiring calling duration information requested by a user from the searched API event or overtime API event, and returning the calling duration information to the global scheduler.

In this embodiment, after the calling duration searching unit locally searches for the timeout API event related to the tracking identifier, the calling duration searching unit may further include: and if not, returning the information of the timeout duration corresponding to the user request as the information of the calling duration to the global scheduler.

In this embodiment, the step of locally searching for the API event or the timeout API event related to the tracking identifier by the call duration searching unit may include: searching a third target file where an API event or an overtime API event of which the event generating moment falls into a third searching time window is located, and searching the API event or the overtime API event related to the tracking identifier in the third target file; and the starting time of the third search time window is the generation time of the tracking identifier, and the duration is the maximum life cycle requested by the user.

In this embodiment, the detail event searching unit may include:

a time window subunit configured to: determining [ t1, t2] as the first search time window, wherein t1, t2 are the earliest time and the latest time, respectively, of event generation times included in the searched meta-event;

a search subunit configured to: and searching a first target file in which the detail event falling into the first search time window at the event generation time under a detail event storage directory, and searching the detail event related to the tracking identification in the first target file.

In this embodiment, the searching the first target file by the searching subunit under the detail event storage directory may include: determining a time period of an event generation moment of a detail event stored in a file according to the creation time and/or the last modification time of the file in the detail event storage directory, and searching the file of which the time period falls into the first search time window, namely the first target file; and the detail events are sequentially stored in the files according to the sequence of the event generation time, and when one file reaches a set maximum size, the next file is transferred for storage.

In this embodiment, the meta-event includes identification information of a processor that generates the meta-event, the detail event generated by the same processor is stored in a file group corresponding to the processor, and different processors correspond to different file groups; the searching subunit searches the first target file under the detail event storage directory, and then searches the first target file for the detail event related to the tracking identifier, which may include:

determining a file group corresponding to the processor according to identifier information of the processor contained in the searched meta-event, searching the first target file in the corresponding file group, and searching the detail event related to the tracking identifier in the first target file; or

Determining a file group corresponding to the processor according to identifier information of the processor contained in the searched meta-event, selecting files belonging to the corresponding file group from the first target files after searching the first target files, and searching the selected files for detail events related to the tracking identifier.

In this embodiment, the searching subunit may further be configured to: after searching the detail event related to the tracking identification, counting the number M1 of the searched detail event related to the tracking identification, if M1 is smaller than the number M2 of the meta event related to the tracking identification searched by the meta event searching unit, then searching the detail event related to the tracking identification in other file groups in the detail event storage directory.

In this embodiment, the link query request is a link query request of a distributed read request or a distributed write request of a user; the link analysis module may further include a second query interface unit, the second query interface unit including:

an address acquisition subunit configured to: after receiving the link query request, locally querying a communication event related to the tracking identifier, and if the communication event is queried, acquiring address information of a next hop server to which the user request is sent;

a link query subunit configured to: sending a link query request to the next hop server according to the address information of the next hop server, wherein the link query request carries a tracking identifier of the user request;

an information transfer subunit configured to: and receiving the information of the detail event related to the tracking identification returned by the next hop server and returning the information to the global scheduler.

the memory, configured to: saving the program code;

In this embodiment, the link analysis process executed by the processor in the server may include all processes in the link analysis method of this embodiment, and a description thereof is not repeated here.

The link analysis method and the server of the embodiment obtain the search time window of the detail event by searching the meta event, and provide a plurality of technical means for accelerating meta event search and accelerating detail event search, so that the search range of the detail event can be greatly reduced, and the link analysis process is accelerated.

The distributed link tracking and analyzing method adopted by the embodiment abandons a central warehouse and a large database, and the original information data is always stored on the local server. Only when the user really needs to inquire the link condition requested by the user at a certain time, the relevant events requested by the user are inquired and analyzed. Therefore, most of the time, most of the data requested by the user, such as detail events, are only stored on the server and do not participate in network transmission. In addition, around the characteristic that the events are time-ordered, the above embodiments design some optimization schemes, link analysis for one request can be completed in the second level under the pressure of one-machine-ten-thousand QPS (that is, the server processes 10000 requests per second), and the time is not increased along with the increase of the scale of the target cluster, and is suitable for online query analysis. Scalability is not an issue, either from a storage perspective or a computing perspective.

The invention is further illustrated by means of a few examples of practical applications.

Example 1

This example relates to link tracing, with a primary focus on the generation and preservation of events.

The distributed link trace analysis system of the present example includes a global scheduler (which may be distributed across 1 or several machines) and servers (referred to herein as production servers that provide services to users).

The present example relates to a link tracking method common to class I requests (e.g., search requests) and class II requests (e.g., distributed read requests and distributed write requests).

In this example, the server generates a small meta-event each time it generates a detail event (detail event). In this example, the meta-event contains only a trace identification (traceID), event generation time information, and an identification of a processor that generates the event, such as a CPU number (CPU number for short). A meta-event is only about 20 bytes, which is about a few tenths of a detail event. Although meta-events are generated along with detail events, these two types of events are stored in two local directories.

In this example, when events (meta event, detail event, communication event, and API event) are saved in a local file during the link tracking process, the following rules are followed:

a) the detail event generated by the same processor is stored in a file group corresponding to the processor, and different processors correspond to different file groups.

Each processor in the server corresponds to a file group, in this example, the CPU number of each CPU is directly written in the file name of the file generated by each CPU, so that the file group corresponding to the CPU can be determined very conveniently, but the present invention may also adopt other ways to bind the file group with the processor, such as establishing a correspondence table between the processor identifier and the file attribute (including but not limited to the file name). When the server stores the event, the CPU generating the event selects and stores the file in the corresponding file group. The maximum size of each file may be fixed, e.g., set to 2 mbytes. When the file size exceeds the limit, the file is transferred to the next file of the same file group for saving. For example, the file may be generated in a round robin (rotate) manner, in which the file name includes a sequence number and the sequence number is sequentially incremented, and the number of generated files is the set maximum number and then the file that was generated first is overwritten.

According to the above rules, the server stores events in a local storage directory as a collection of files whose names may indicate which CPU generated the events it stores.

b) The generation of each event is time-ordered, and the server stores the events according to the sequence of the generation time of the events.

Events are sequentially stored in a file according to the sequence of the generation time, and the attribute of the file, namely the last modification time, can reflect the event generation time of the last event stored in the file (some files only record one modification time, which is the last modification time of the file). In conjunction with rule a), the last modification time of the previous file may be the creation time of the file. Likewise, the "creation time" of one file may reflect the event generation time of the first event that the file saved, while the "creation time" of the next file may be the last modification time of the file.

In this example, the source server needs to perform the following processing:

a) the source server (e.g., in front-end server a of fig. 1) encodes the traceID generation time into the traceID when it generates the traceID. Thus, at link analysis, the server can decode its generation time from the traceID.

b) When the user request is completed, the source server generates an API event, and the information recorded by the API event comprises the tracking identifier and the calling duration of the user request. The API event may be a special meta-event, generated along with a detail event generated when the user request is completed, but may also be generated separately. In this example, the source server finds out the timeout API events in the API events and stores the timeout API events in a special directory (that is, only the timeout API events are stored in the stored directory), where the timeout API events refer to API events in which the calling duration of the user request exceeds the corresponding timeout duration (the timeout duration set for the user request). For API events, they may also be saved under a proprietary directory of API events to facilitate searching (timeout API events may or may not be included under the API event storage directory).

c) If the search time window is narrowed by using the calling time length requested by the user, the IP address of the source server is encoded into the traceID when the source server generates the traceID, and the global scheduler can decode the IP address of the source server from the traceID during link analysis and initiate the query of the calling time length requested by the user to the source server.

The special requirements of this example for the link tracking method for class II requests are as follows:

a) whether the search time window is narrowed by using the calling time length requested by the user or not, when the source server generates the traceID, the IP address of the source server is encoded into the traceID, and then the global scheduler can decode the IP address of the server generating the traceID from the traceID and initiate link inquiry to the source server.

b) When the server processes the network communication requested by the user, a communication event is generated, the communication event comprises the IP address of the next-hop server to which the user request is sent, and the event can be stored in the communication event storage directory independently.

In this example, the communication event is stored in the exclusive directory of the communication event, and the timeout API event is stored in the exclusive directory of the timeout API event, which is independent from the storage directory of other events (but may be a sub-directory of the storage directory of other events), so that the search for these two types of special events can be accelerated. But the invention is not limited thereto.

Example two

The example relates to a link analysis process of a type I request such as a search request, and a related link tracking process adopts the scheme of the example I, and the example mainly focuses on the collection of event information in the link analysis.

The general process of the present example link analysis process please refer to fig. 8, which includes:

(1) a user needs to check the link condition of a certain user request (hereinafter referred to as a request), and sends a link check request to a global scheduler, wherein the link check request carries the traceID of the request;

(2) the global scheduler sends a link query request to all production servers in the distributed cluster, and the link query request carries the traceID; after receiving the information, an analyzer (equivalent to a link analysis module in the embodiment) on each server searches all detail events related to the traceID from a local storage directory, and if the detail events are searched, sends the information of the searched detail events back to the global dispatcher;

(3) the global scheduler collects the information of all detail events related to the traceID, performs link analysis processing on the information, and sends the final processing result back to the user.

The "server program embedded in the point" in the cluster production server in fig. 8 is used to complete the event generation and storage function, and corresponds to the link tracking module in the embodiment.

Specifically, the analyzer in the server performs a two-stage process, and as shown in fig. 9, the first stage analyzer searches whether the server has generated a meta-event related to the traceID under the local meta-event storage directory. If the event is searched, searching all detail events related to the traceID in a local detail event storage directory in the second stage, and returning a search result to the global controller; if the first phase does not search for a meta-event associated with the traceID, it exits directly and the second phase is not experienced. Because the first stage searches in a local smaller meta-event storage directory, the cost and time for directly searching the detail event storage directory which is tens of times larger than the meta-event storage directory are much less.

This example takes two optimization schemes for the first phase.

First optimization scheme of first stage

The scheme does not search the whole meta-event storage directory when searching the traceID. The entire storage directory of each server may store several days of data, and user requests typically have a timeout that is not large (the timeout for each type of request is different, but is less than or equal to the maximum lifetime of the user request, e.g., 10 minutes). The server may decode the traceID generation time from the traceID so that the first stage may search for data in the 10 minute time range from the traceID time, i.e. the search time window used by the first stage, referred to herein as the second search time window.

When storing events (including meta-events and detail events) in the link tracking process, a file group is selected according to the CPU number of the CPU generating the event, and then the events are stored in the files of the file group according to the time sequence, when the size of one file reaches the set maximum size (such as the size limited by a user), the file is transferred to the next file in the file group to store a new event, so that the attribute 'last modification time' of each file can reflect the event generation time of the last event of the file. In other words, each file can infer from itself and the last modification time of the previous file the time period in which all events saved in the file were. Thus the first stage does not spend the entire directory search, but only searches for the meta-event containing the traceID from the set of files whose time period falls within the second search time window. This optimization is such that the maximum time consumption of the first phase does not increase as the number of meta-events under the meta-event store directory grows, as long as the standalone QPS is fixed.

The processing flow of the analyzer (equivalent to the link analysis module) in the first stage server (see fig. 10) includes:

the analyzer analyzes the traceID generated time stamp t from the encoding of the traceID

Traversing in the local meta-event storage directory to find out a set consisting of all files with model time in the range of [ t, t +10 minutes ]: s

If the set S is not empty, searching a meta-event related to the traceID in the file of the set, and recording the meta-event set as E; otherwise, directly returning an empty result;

the CPU number of each meta-event in the set E is recorded, and this information can be used to optimize the second phase later.

Second optimization scheme of first stage

The optimization scheme can accelerate the link analysis of the timeout request. Timeout requests are a type of request that is of great interest to users. The first optimization scheme is that the maximum life cycle of 10 minutes requested by a user is used as the duration of the second search time window, namely, the first stage searches the meta-event storage directory of the server, and whether a meta-event containing a traceID exists in data within 10 minutes from the occurrence time of the traceID. But not all kinds of requests (write requests, read requests, etc.) are 10 minutes timeout, the optimization is to find the true timeout time for the user request-which is typically much less than 10 minutes. The source server only needs to generate a timeout API event and store the timeout API event in another private directory after the call duration included in the API event exceeds the corresponding timeout duration (which may be set by the user). Normally, the number of timeout requests is much smaller than the number of normal requests, so the timeout API event storage directory is at least an order of magnitude smaller than the normal meta-event storage directory.

According to the scheme, a local overtime API event set is searched for an overtime API event containing the traceID through a source server, the calling duration of the request is obtained from the searched overtime API event, if the overtime API event cannot be searched, a common API event set can be searched again, or the overtime duration corresponding to the request is used as the calling duration of the request; the original 10 minutes may then be replaced by the call duration of the request as the duration of the second search time window for each server search meta-event.

The procedure of the first phase optimized for timeout requests (see fig. 11 for its interaction with the global scheduler and the origin server) includes:

the global scheduler analyzes the traceID to obtain the IP address of the source server generating the traceID, sends the requested traceID to the server to inquire the calling duration of the request, and then waits for the analyzer to return the calling duration t1 of the request;

the parser on the source server parses from the received traceID encoding the timestamp generated by the traceID: t;

searching whether a file contains the overtime API event of the traceID or not in a set consisting of files with the time periods from creation to last modification within the range of [ t, t +10 minutes ] in a local overtime API event storage directory by an analyzer on the source server, wherein the time periods from creation to last modification of the files, namely the time periods of the event generation moments of the events stored in the files, can be determined according to the last modification time and/or the creation time of the files (or the files and the adjacent files thereof);

if the request is found, the analyzer on the source server takes out the calling duration (duration) t1 of the request and sends the calling duration (duration) t1 to the global scheduler, and if the request is not found, the time-out duration corresponding to the request is taken as t1 and sent to the global scheduler;

the global scheduler attempts to wait for t1 from the parser on the source server, and sets x to t1 if t1 is received within a reasonable wait time, otherwise sets x to 10 minutes. Sending x as y in the parser program to the parsers on all servers together with the traceID;

after receiving the traceID and the corresponding y value sent by the global scheduler, the analyzers on all the servers find all the file sets with the time period from creation to last modification within the time range of [ t, t + y ] from the local meta-event storage directory: s;

if the set S is not empty, meta-events related to the traceID are searched from the set S, and the searched meta-event set is marked as E; otherwise, directly returning an empty result;

the CPU number contained in each meta-event in the event set E is recorded, and this information can be used to optimize the following second stage.

If the analyzer on the server finds that the target traceID (namely the traceID requested by the user needing to perform the link analysis) is searched from the meta-event storage directory of the server in the first stage, the second stage processing process is started, otherwise, the searching process is directly ended. The second phase requires searching all events related to the target traceID from the local larger detail event storage directory and sending the results to the global scheduler.

The second stage also has two optimizations. Firstly, each detail event is accompanied by a meta-event, so that the first searching time window of the detail event on the server can be known by utilizing the time of all the meta-events on the server searched in the first stage, and the searching of the whole detail storage directory is not required. The second optimization is related to the CPU number, because the event is stored in the file of the corresponding file group in which CPU the event is generated, the files of the corresponding file group in the detail event storage directory can be directly searched by knowing which CPU the certain detail event is generated in. The server usually has 24 or 32 CPUs, and now a detail event can be found only by searching a file generated by one CPU, so that the search amount can be greatly reduced, and the search efficiency is improved. Since the meta event and the detail event are generated concurrently, the two events are generated by the same CPU in most cases. Therefore, the CPU number contained in each meta-event on the server searched in the first stage can be used for searching for the detail event in the file of the file group corresponding to the detail event storage directory. In rare cases, CPU scheduling happens between the meta-event and the corresponding detail event, which causes search failure, so that the number of the searched meta-event and the number of the detail event are inconsistent, and at this time, the meta-event and the detail event can be searched in other files in the detail event storage directory. With these optimizations, the second stage search time does not increase with the enlargement of the detail event storage directory as long as the standalone QPS is fixed.

The specific process of the second stage of this example (see fig. 12) includes:

(1) and the analyzer on the server obtains the information of the event generation time and the CPU number of each meta-event associated with the target traceID from the first stage, and determines a first search time window [ t1, t2] and a CPU set X to be searched. In addition, the number of meta-events searched in the first stage can also be calculated: n is

(2) Finding out files falling into [ t1, t2] from the time period from creation to the last modification from the local detail event storage directory, and selecting files under the file group corresponding to the CPU set X from the files to form a set S;

(3) if the set S is empty, jumping to (5); otherwise, executing (4);

(4) and searching all detail events related to the traceID from the files of the set S to form a set E, and judging whether the number m of the events in the set E is less than n. If m is equal to n, it indicates that the file in set S contains all detail events, and set E2 is set to null and jumps to (6). If m is less than n, the set E2 of the detai events which are not found needs to be searched, and (5) is executed;

(5) finding files (except the searched file group) which are created until the time period of the last modification falls into [ t1, t2] from the local detail event storage directory, and searching the detail events which are not found to form a set E2. Note that the file to be searched at this time includes all files generated by the CPU;

(6) and returning the set E and the upper set E2 as a search result set F to the global scheduler, and ending the second phase.

Example three

The example relates to a link analysis process of type II requests, such as distributed write requests or distributed read requests, and the related link tracking process adopts an example of a scheme related to type II requests, and the example mainly focuses on the collection of event information in link analysis.

Integrated process

Because the traceID embeds the IP address of the source server that originated it, the global scheduler can decode the IP address of the corresponding source server for a given traceID and send the traceID to the source server. The parser on the source server performs the two-phase process described in the type I request, searches for local events associated with the traceID, finds which servers (next hop servers) the traceID was sent to by itself, sends a link query request to these servers, carrying the traceID. The analyzers on these servers also operate in this manner until the last hop server.

The type II request and the type I request are different in that it involves few servers, so such a chained tracking scheme can make only several servers that actually process the related requests participate in the link analysis process, and most of the servers in the cluster are not disturbed, which can minimize the disturbance of the link analysis to the production server.

The link analysis flow for the type II request of the present example is shown in fig. 13, and includes:

(1) a user wants to check the link condition of a certain request, and sends a link check request for the request to a global scheduler, wherein the link check request carries the traceID of the request;

(2) the global scheduler analyzes the IP address of the source server generating the traceID, and sends a link query request to the source server, wherein the link query request carries the traceID.

(3) After receiving the traceID, the analyzer on the source server searches for an event associated with the traceID according to the two-stage processing procedure of the type I request, searches for the IP address of the next-hop server to which the traceID is sent from the communication event, and sends a link query request to the next-hop server, carrying the traceID. This chains up to the last hop server.

(4) All analyzers on the servers receiving the link query request send the searched information of the detail event related to the traceID back to the global scheduler;

(5) and the global scheduler collects all the events, analyzes and processes the events and sends the results to the user.

In addition to the two-stage process for type I requests in example one, the analyzer on server of this example also performs a trace process of finding and sending a traceID to a downstream server, as shown in fig. 14, the two-stage process and the downstream trace process of the analyzer for type II requests include:

two-stage treatment process: the same as in example two;

and (3) a downstream tracing process: during two-stage processing, the analyzer searches whether the communication event storage directory contains communication events related to the traceID, if so, finds the IP addresses of all next-hop servers, and sends a link inquiry request to the next-hop servers, wherein the link inquiry request carries the traceID; otherwise, directly returning.

The two-stage first stage process of type II requests, besides the same advantages as the type I requests, can greatly reduce CPU and disk IO resources of the cluster because only several servers are involved, i.e. only analyzers on these servers will run the second stage, and most analyzers on the servers in the cluster will not go through the second stage. This scheme has greater advantages over such requests.

Since the two-phase process for type II requests is the same as the two-phase process for type I requests, only the flow of downstream tracing needs to be specifically looked at:

the downstream tracing process is to search the communication event storage directory instead of the meta event storage directory, so the optimized scheme and effect are the same as those of the second example, as shown in fig. 15, which includes:

(1) parsing the traceID encoding to generate a timestamp t

(2) Traversing and finding out a set consisting of all files which are created until the time period of the last modification falls into [ t, t +10 minutes ] in the local communication event storage directory: s

(3) If the set S is not empty, searching a communication event related to the traceID in the file of the set, and recording the communication event set as E; otherwise, directly returning a null result

(4) If the set E is not empty, recording the IP address of the next hop server contained in each communication event in the set E, and sending a link inquiry request to the next hop server, wherein the link inquiry request carries the traceID; otherwise, directly returning an empty result;

(5) the downstream tracing process is ended.

The above-described embodiments and examples enable fully localized storage and computation, eliminating the central repository and large databases in classical distributed link trace analysis systems. In addition, the method has at least the following characteristics:

each detail event is accompanied by a meta-event for optimizing the process of searching the traceID locally, including utilizing the generation time of the meta-event, generating the CPU number and the like;

generating timeout API events for optimizing search timeout request processes

And adding a communication event containing a target service IP (Internet protocol) for optimizing the searching process of the II-type request, wherein only a plurality of servers which really process the request participate in the link analysis process, and most of the servers in the cluster are not disturbed. Minimizing interference of link analysis to production servers

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A distributed link tracking method, comprising:

2. The method of claim 1, wherein:

the tracking identification comprises information of the generation moment of the tracking identification; or

The tracking identification comprises information of the generation moment of the tracking identification and address information of the source server.

3. The method of claim 1, wherein:

the server separately saves the detail event and the meta event in the local, and comprises the following steps: and sequentially storing the detail events and the meta events in files under respective directories according to the sequence of the event generation time, and transferring to the next file for storage after one file reaches the set maximum size.

4. A method as claimed in claim 1, 2 or 3, characterized by:

the meta-event also contains identification information of a processor generating the meta-event;

the server saves the detail event locally, and comprises the following steps: and storing the detail events generated by the same processor in a file group corresponding to the processor, wherein different processors correspond to different file groups.

5. A method as claimed in claim 1, 2 or 3, characterized by:

the method further comprises the following steps:

and the source server of the user request generates and stores an Application Programming Interface (API) event when the user request is completed, wherein the API event comprises the tracking identifier and the calling duration information of the user request.

6. The method of claim 5, wherein:

after the source server generates the API event, the method further includes: and storing the overtime API events in the API events in a special directory of the overtime API events, wherein the overtime API events refer to API events of which the calling duration requested by a user exceeds the corresponding overtime duration.

7. The method of claim 1 or 2 or 3 or 6, wherein:

the user request is a distributed read request or a distributed write request;

the method further comprises the following steps: and the server generates a communication event when performing network communication processing for sending the user request to a next hop server, wherein the communication event comprises the tracking identifier and address information of the next hop server to which the user request is sent.

8. A server in a distributed cluster, comprising a link tracking module, wherein the link tracking module comprises:

9. The server of claim 8, wherein:

the link tracking module further comprises: an identification generation unit configured to: generating a tracking identifier for a user request, and encoding information of the generation moment of the tracking identifier into the tracking identifier, or encoding the information of the generation moment of the tracking identifier and the address information of the server into the tracking identifier.

10. The server of claim 8, wherein:

the event storage unit separately stores the detail event and the meta event in local, and comprises: and sequentially storing the detail events and the meta events in files under respective directories according to the sequence of the event generation time, and transferring to the next file for storage after one file reaches the set maximum size.

11. The server according to claim 8, 9 or 10, wherein:

the meta-event generated by the event generating unit includes identification information of a processor generating the meta-event;

the event storage unit stores the detail event locally, and comprises: and storing the detail events generated by the same processor in a file group corresponding to the processor, wherein different processors correspond to different file groups.

12. The server according to claim 8, 9 or 10, wherein:

the event generating unit is further configured to generate an Application Programming Interface (API) event when the user request is completed, where the API event includes the tracking identifier and the calling duration information of the user request;

the event storage unit is further configured to store the API event.

13. The server of claim 12, wherein:

after the event storage unit generates the API event, the method further includes: and storing the overtime API events in the API events in a special directory of the overtime API events, wherein the overtime API events refer to API events of which the calling duration requested by a user exceeds the overtime duration.

14. The server according to claim 8 or 9 or 10 or 12, wherein:

the event generating unit is further configured to generate a communication event when performing network communication processing for sending the user request to a next hop server, where the communication event includes the tracking identifier and address information of the next hop server to which the user request is sent; the user request is a user request for a distributed read or a distributed write.

15. A server in a distributed cluster, comprising a processor and a memory,

the memory, configured to: saving the program code;

the processor is configured to: reading the program code and performing the following link trace processing: in the process of processing a user request, generating a meta-event for searching a detail event when the detail event is generated, wherein the meta-event comprises a tracking identifier and information of an event generation time, and the data volume is less than that of the corresponding detail event; and the generated detail event and the meta event are separately stored in the local.

16. A distributed link analysis method, comprising:

17. The method of claim 16, wherein:

the tracking identification comprises information of the generation moment of the tracking identification, and the link inquiry request also carries the maximum life cycle information requested by a user; or

The tracking identifier comprises information of the generation moment of the tracking identifier, and the link query request also carries the calling duration information of the user request, which is obtained by the global scheduler querying a source server.

18. The method of claim 16 or 17, wherein:

the user request is a search request; the global scheduler sends a link query request to the servers in the distributed cluster, including: sending the link query request to all servers in the distributed cluster; or

The user request is a distributed read request or a distributed write request, and the global scheduler sends a link query request to a server in the distributed cluster, including: and sending the link inquiry request to an origin server requested by the user.

19. A global scheduler in a distributed cluster, comprising:

20. The global scheduler of claim 19, wherein:

the link query request sent by the link query module also carries the maximum life cycle information requested by a user, and the tracking identifier comprises the information of the generation moment of the tracking identifier; or

The link query request sent by the link query module also carries the call duration information of the user request obtained by the global scheduler querying a source server, and the tracking identifier contains the information of the generation moment of the tracking identifier.

21. The global scheduler of claim 19 or 20, wherein:

22. A global scheduler in a distributed cluster, comprising a processor and a memory,

the memory, configured to: saving the program code;

23. A distributed link analysis method, comprising:

24. The method of claim 23, wherein:

the meta-events are sequentially stored in the files according to the sequence of the event generation time, and when one file reaches a set maximum size, the next file is transferred for storage;

the server locally searches for meta-events related to the tracking identity, and the meta-events comprise: searching a second target file where the meta-event with the event generating time falling into a second searching time window is located under the storage directory of the meta-event, and searching the meta-event related to the tracking identification in the second target file;

the starting time of the second search time window is the generation time of the tracking identifier, the duration is the calling duration of the user request or the maximum life cycle of the user request, and the information of the generation time of the tracking identifier is carried in the link query request.

25. The method of claim 24, wherein:

the server searching the second target file comprises: determining the time period of the event generation time of the meta-event stored in the file according to the creation time and/or the last modification time of the file in the meta-event storage directory; and searching the files of which the time period falls into the second searching time window, wherein the searched files are the second target files.

26. The method of claim 24, wherein:

the information of the calling duration is carried in a link query request sent by the global scheduler;

the method further comprises the following steps:

the source server locally searches an Application Programming Interface (API) event or an overtime API event related to the tracking identifier, acquires the information of the calling duration from the searched API event or overtime API event and returns the information to the global scheduler;

the API event comprises tracking identification of a user request and information of calling duration, and the overtime API event refers to an API event of which the calling duration of the user request exceeds the corresponding overtime duration.

27. The method of claim 26, wherein:

after locally searching for a timeout API event associated with the tracking identity, the source server further includes: and if the overtime API event related to the tracking identification is not searched, returning the information of the overtime duration corresponding to the user request as the information of the calling duration to the global scheduler.

28. The method of claim 26 or 27, wherein:

the source server locally searches for an API event or a timeout API event related to the tracking identification, and the method comprises the following steps: searching a third target file where an API event or an overtime API event of which the event generating time falls into a third searching time window is located, and searching the API event or the overtime API event related to the tracking identifier in the third target file, wherein the starting time of the third searching time window is the generating time of the tracking identifier, and the duration is the maximum life cycle requested by a user.

29. The method of any one of claims 23-27, wherein:

the detail events are sequentially stored in the files according to the sequence of the event generation time, and when one file reaches a set maximum size, the next file is transferred for storage;

the server determines a first search time window for the detail event, and locally searches the detail event related to the tracking identification according to the first search time window, wherein the method comprises the following steps:

30. The method of claim 29, wherein:

the server searches the first target file under the detail event storage directory, and the method comprises the following steps: and determining the time period of the event generation time of the detail event stored in the file according to the creation time and/or the last modification time of the file in the detail event storage directory, then searching the file of which the time period falls into the first search time window, wherein the searched file is the first target file.

31. The method of claim 30, wherein:

the meta-event comprises identification information of a processor generating the meta-event, the detail event generated by the same processor is stored in a file group corresponding to the processor, and different processors correspond to different file groups;

the server searches the first target file in a detail event storage directory, and then searches the first target file for detail events related to the tracking identification, wherein the method comprises the following steps:

32. The method of claim 31, wherein:

after the server searches for the detail event related to the tracking identifier, the method further includes: counting the number M1 of detail events related to the tracking identification, for example, M1 is smaller than the number M2 of meta events related to the tracking identification searched locally, and then searching other file groups in a detail event storage directory for detail events related to the tracking identification.

33. The method of any of claims 23-27, 30-32, wherein:

the link query request is a link query request of a distributed read request or a distributed write request of a user;

after receiving the link query request, the server further includes:

34. A server in a distributed cluster, comprising a link analysis module, wherein the link analysis module comprises:

35. The server according to claim 34, wherein:

the meta-event searching unit locally searches for meta-events related to the tracking identifier, and includes: searching a second target file where the meta-event with the event generating time falling into a second searching time window is located under the storage directory of the meta-event, and searching the meta-event related to the tracking identification in the second target file; the starting time of the second search time window is the generation time of the tracking identifier, the duration is the calling duration of the user request or the maximum life cycle of the user request, and the information of the generation time of the tracking identifier is carried in the link query request.

36. The server according to claim 35, wherein:

the meta event search unit searches the second target file, including: determining the time period of the event generation time of the meta-event stored in the file according to the creation time and/or the last modification time of the file in the meta-event storage directory; searching the file with the time period falling into the second searching time window, namely the second target file; and the meta-events are sequentially stored in the files according to the sequence of the event generation time, and when one file reaches a set maximum size, the next file is transferred for storage.

37. The server according to claim 35, wherein:

the link analysis module further comprises:

38. The server according to claim 37, wherein:

after the local search for the timeout API event associated with the trace identifier, the call duration searching unit further includes: and if not, returning the information of the timeout duration corresponding to the user request as the information of the calling duration to the global scheduler.

39. The server according to claim 37 or 38, wherein:

the calling duration searching unit locally searches for the API event or the timeout API event related to the tracking identifier, and the calling duration searching unit comprises: searching a third target file where an API event or an overtime API event of which the event generating moment falls into a third searching time window is located, and searching the API event or the overtime API event related to the tracking identifier in the third target file; and the starting time of the third search time window is the generation time of the tracking identifier, and the duration is the maximum life cycle requested by the user.

40. A server according to any one of claims 34-38, wherein:

the detail event search unit includes:

41. The server according to claim 40, wherein:

the searching subunit searches the first target file under the detail event storage directory, including: determining a time period of an event generation moment of a detail event stored in a file according to the creation time and/or the last modification time of the file in the detail event storage directory, and searching the file of which the time period falls into the first search time window, namely the first target file; and the detail events are sequentially stored in the files according to the sequence of the event generation time, and when one file reaches a set maximum size, the next file is transferred for storage.

42. The server according to claim 41, wherein:

the searching subunit searches the first target file in a detail event storage directory, and then searches the first target file for a detail event related to the tracking identifier, including:

43. The server according to claim 42, wherein:

the search subunit is further configured to: after searching the detail event related to the tracking identification, counting the number M1 of the searched detail event related to the tracking identification, if M1 is smaller than the number M2 of the meta event related to the tracking identification searched by the meta event searching unit, then searching the detail event related to the tracking identification in other file groups in the detail event storage directory.

44. The server according to any one of claims 34-38, 41-43, wherein:

the link analysis module further includes a second query interface unit, the second query interface unit including:

45. A server in a distributed cluster, comprising a processor and a memory,

the memory, configured to: saving the program code;