In this section, the detailed design and implementation method of the geographic data-sharing and computing framework will be introduced.
3.1. Design of the Data Service Container
In terms of technical implementation, a data service container is a lightweight hosting server for geographic data services, whose design goal is to provide data-access services to users through a series of common interfaces for heterogeneous geographic data resources.
3.1.1. UDX Model
The UDX model was proposed for the structured description of multi-source heterogeneous geographic data [
29]. UDX represents the content information of geographic data through the combination of data nodes, as shown in
Figure 2. Each UDX node represents a specific data type, such as “list” type, “float/real” type, “int” type, and so on. By organizing these nodes according to certain logic, arbitrary data content can be expressed. In
Figure 2, the pollutant flux data stored in Excel are represented by two nested “DTKT_LIST” nodes.
As can be seen from
Figure 2, unlike a specific data format, the UDX model is a descriptive model, which is only responsible for describing the content of the data, not how the data are organized and stored. In this way, users can obtain the data they want based on UDX, such as obtaining the pollutant data of “Yuxi River” in January, which can be obtained directly by accessing the UDX data node, without reading the Excel file and extracting the required data from it.
In addition, in the traditional data-sharing process, after users receive the raw data, they often need to configure the corresponding development environment to process and operate the data to meet their specific data needs, such as reading and writing Excel files requires downloading the corresponding development libraries and configuring the development environment. This process is often cumbersome and time consuming, and the data-processing methods developed are difficult to share with other users. However, if the user obtains UDX data, they can use the data read and write interface provided by UDX to process the data. This is obviously much easier than using the underlying data read and write interface, and the UDX-based data-processing methods are also easier to share with other users.
Therefore, for data sharing based on UDX, data providers, data users, and data processors will be separated. The different roles only need to focus on their responsibilities: the data provider only needs to be responsible for describing what data they share, the data user only needs to focus on how the data are processed to meet their data needs, and the data processor only needs to focus on how to implement UDX-based data processing and then share the processing method with the other two roles.
The construction of data services in this paper is also based on a UDX model, aiming to provide users with a general view of data operation, without paying attention to the details of the organization and storage of the underlying data, thus helping to improve the efficiency of data sharing.
3.1.2. Generation of Data Service
From the perspective of data use, in most cases, when users receive the data, they need to carry out specific data-processing operations to obtain the result data that meet their application requirements. Therefore, the data service proposed in this paper includes both a data resource service and data-processing service, as shown in
Figure 3.
- (1)
Data resource service
The raw data are standardized through a UDX model, and then published as data services; such data services are called data resource services. As shown in
Figure 2, if the pollutant flux data are published as a service, it is a data resource service, and users can directly access UDX nodes to obtain corresponding data resources.
Generally, geographic data come from a wide range of sources and are stored in a variety of ways, such as data stored in files, data accessed through API interfaces, and data in spatial data. But no matter how the data are organized and stored, there are always relevant data read and write interfaces that can retrieve the relevant data content from these original data sources. For example, common data formats such as GeoTIFF, Shapefile, NetCDF (Network Common Data Form), etc. have associated data read–write libraries such as GDAL, Geopandas, and the NetCDF read–write interface. For user-defined data formats, you can also write relevant code to read and write data. After the data content information is obtained, the UDX API interface can be used to map the data content information into UDX data nodes, and organize these data nodes according to certain logic to form a common data view that users can easily understand. At this point, the data resource service can be published, and the common data view can be used as the data content that the data resource service shares externally. When users access the data resource service, they can understand the data content provided by the UDX description view, and access the corresponding data node according to their own requirements. The service background will invoke the UDX API to read the relevant data node and return the node data content to the user.
- (2)
Data-processing service
When users need to obtain some specific data, and the existing data resource service cannot meet their needs, they can realize data customization by invoking the data-processing service. For example, the data resource service for pollutant data release in
Figure 2 only provides the original pollutant data, but if the user needs statistical data such as the mean value, variance, and standard difference of pollutant flux at each site, the data service cannot be directly provided. At this time, it is necessary to call the data-processing service with statistical function, carry out numerical statistical processing by reading the data content in the data resource service, and then return the result to the user.
Data processing is a very broad concept, and any operation conducted on data can be called data processing. This paper focuses on three types of data-processing method: (a) Data extraction refers to the extraction of some data from the specified data, such as the extraction of some interesting research area data from a DEM data. (b) Data mapping refers to data exchange operations between raw data and UDX data nodes. In other words, behind the data resource services mentioned above is the data-mapping method at work. (c) Data refactoring can be broadly understood as processing one form of input data into another form of data. For example, elevation data stored in GeoTIFF can be processed into ASCII GRID files using data-reconstruction methods. The statistical processing of pollutant data mentioned above can also be considered as a data-reconstruction operation.
Different data-processing methods may be implemented in different ways, mainly in the programming language they use (such as Python, Java, C#, etc.), the way they are called (such as Python scripts, DLL, exe, jar, etc.), and the running environment (such as Python environment, NET framework, JDK, etc.). Since most programming languages support invoking on the command line, this paper describes the invoke interface of data-processing methods as a invoke-mode based on the command line, and passes the path of the data to be processed to the method in the form of command line parameters, so as to realize data processing.
Generally, for complex custom data formats, UDX works better because of the lack of a common library of data read and write methods. These custom data formats can be mapped to UDX data by data-mapping methods, and then data processing based on UDX data can directly prevent users from manipulating raw data. At this time, the data-processing method needs to support UDX read and write, so this part of the data-processing method needs to use UDX API to encapsulate its data read and write interface. In this way, the published data-processing service and data resource service can cooperate with each other, so as to achieve the diversified data requirements of users.
3.1.3. Access and Invocation of Data Service
After the data service is generated, the following questions need to be considered: How can users access and invoke the data service? How do data services perform? How does the container handle the exceptions during data service execution? To address these issues, remote access to data services, asynchronous execution, and runtime-monitoring methods are designed, as shown in
Figure 4.
- (1)
Remote accessing
The sharing of data resources often involves multiple departments; for example, data centers usually have multiple subordinate units. So how to coordinate the data sharing between subordinate departments and leading departments as well as among subordinate departments is an urgent problem to be solved. The data service container designed in this paper supports deployment on multiple different service nodes, and data services in different nodes can be migrated to each other. For example, in the data container node A, there is a large-scale data service, but the related data-processing services are deployed in the container node B. In general, migrating data-processing methods is much more efficient than migrating data resources, so the data-processing service can be migrated from container node B to node A, so that users can invoke the data-processing service in container A to obtain the data they want.
- (2)
Asynchronous execution
Usually, the invocation of data services is a very time-consuming operation, whether the user directly requests the data resource service or invokes the data-processing service, it needs a certain execution time. Therefore, the invocation of a data service is implemented asynchronously. When the container receives the user’s invocation request, the container process will prepare for the execution of the data service based on the user’s input (e.g., input data files, input parameters, etc.), such as initializing the working directory, specifying the output directory, and determining the parameter passing order. The container process then creates a separate execution process that acts as a data service. At this point, the container process can continue to process other invocation requests without blocking, thus increasing the concurrency of data service invocations. When the data service is completed, the executing process notifies the container process and passes the results of the data service execution to the container process, which then forwards the output to the user. Since the data service is run in an independent process, even if an exception occurs during the data service operation, it will not affect the container process, thus ensuring the stability of the data service container.
- (3)
Service monitoring
In order to ensure the successful execution of data services, it is necessary to monitor the execution process of data services. The runtime monitoring of data services is the responsibility of the monitoring process. For the stability of the data service container, the monitoring process is also an independent process created by the container process, which is specifically responsible for the monitoring and exception handling of the execution process of the data service.
After the data service is started, the monitoring process first checks the input of the data service, such as checking whether the input data organization meets the requirements and whether the input control parameter is an invalid value. The monitoring process then continues to monitor for exceptions during the execution of the data service, such as data I/O exceptions (such as errors in reading and writing data), timeout exceptions (such as long waits caused by processing large data sets), and runtime exceptions (such as manipulating the wrong memory address in the code). In addition, the monitoring process will monitor the execution status of the data service in real time, such as the input data status, running status, output status, error status, etc., to feed back the execution information of the data service to the container process in real time. When the monitoring process detects that the data service is abnormal, it immediately terminates the data service-execution process and reports the exception information to the container process. In addition, the temporary files generated during the execution of the data service and the applied system resources will be cleared and released to ensure the efficiency of the container operation.
3.2. Design of Workspace
In general, the data service container provides data services for users in the form of a single service. In other words, users can only call one data service (data resource service or data-processing service) in the data service container each time. When users need to handle complex data requirements, they need to manually invoke data services several times to meet their data requirements, which is very inconvenient. Therefore, a place that can provide data access and configuration for users is needed, which is the workspace proposed in this paper. In the workspace, users can integrate required data resources and data-processing methods, and then customize their data requirements based on these data services.
Figure 5 shows the design of the workspace.
3.2.1. Configuration of Data-Processing Workflow
When there are many data services distributed in the network space, how to find and use them in time is the concern of data users. Therefore, the registration mechanism of the data service is proposed in this paper. After a data provider deploys a data service container, he/she can decide whether the container provides data services on the Intranet or in an open network environment. When a data service container is authorized to be shared in an open network, it will be registered in the workspace. Any data service information updated in this service container in the future will be synchronized and updated to the workspace. In this way, the data service resources in the open network environment are recorded in the workspace in real time.
When users need to customize their data requirements, they need to create a workspace instance in the workspace. A workspace instance contains all the resources and tools to customize the current data requirements, including data resources, data-processing methods, and data configuration tools (such as workflow canvas, service-binding tool, etc.). Then, according to the requirements of data processing, users can visualize the data-processing logic through workflow canvas. For example, when users need to obtain DEM data of a target research area, they need to perform image mosaic, reprojection, and clipping operations on the original DEM data (which has been published as data services). Then, drag and drop three workflow nodes in the workflow canvas to represent the above three data-processing operations, respectively, and use arrows to connect these three nodes in sequence. At this time, the workflow construction of DEM data extraction in the target research area is completed. Moreover, when there are special data requirements, for example, missing values in DEM data need to be filled before DEM data clipping, the user can still drag and drop a node to the workflow canvas to represent data filling operations. Then, break the arrow line between the “reprojection” and “clipping” nodes, and reconnect the “reprojection”, “data filling”, and “clipping” nodes. Thus, a data-processing workflow that meets the specific requirements is configured.
Each node in the workflow represents an abstract data-processing process that is not capable of execution. Users can customize any data-processing nodes and connect them according to the sequence of data-processing logic to form an abstract data-processing logic workflow. Finally, a data-processing workflow that meets the user’s data customization needs is built. However, the workflow cannot perform real data-processing operations until the corresponding data service is bound to each node through the service-binding tool.
3.2.2. Generation of Data-Computing Solutions
When the data-processing workflow is configured, it can only represent the data-processing logic, and cannot really carry out data processing. At this point, it is necessary to bind data-processing services and data resource services to each logical node in the workflow. Since the workspace instance records all data services in the open network space, users can access and select the relevant data services directly and bind them to the corresponding workflow node. However, in practice, the existing data services in the network may not all meet the needs of users to configure data. Users can only associate and bind those data services that meet the requirements to the data-processing node, and for those missing data services, users can deploy their own data service container to publish the relevant data service for the corresponding data-processing nodes. For example, when users need to perform grid segmentation for a certain research area, if there is no corresponding data service in the network, users need to encapsulate the grid-partitioning algorithm (based on open-source or self-implemented code) and publish these algorithms in their own data service container as data-processing services. Then, when the container is registered with the workspace, users can access these grid-partitioning algorithm services in the workspace instance.
When the data service binding is complete, a data-computing solution that can actually run is formed.
Figure 6 shows an example of an XML expression of a data-computing solution.
An executable computing solution consists of three main parts: (1) The “DataCollection” node refers to the collection of data resource services required in the current data-processing workflow. It consists of several “Data” nodes. Each “Data” node represents a specific data resource service. The “id” attribute is used to uniquely identify the data resource service, and the “source” attribute indicates the network host address of the data service container where the current data resource service resides. Thus, a data resource service can be uniquely located through the “id” and “source” attributes. (2) The “MethodCollection” node refers to the collection of data-processing services required in the current data-processing workflow. It consists of several “Method” nodes. Each “Method” node consists of “InputCollection”, “ControlParams”, and “OutputCollection” nodes. The “InputCollection” node consists of several “Input” nodes. Each “Input” node represents an input to the current data-processing service, and the “id” attribute of the node corresponds to the “id” attribute of the “Data” node. The “ControlParams” node consists of several “Param” nodes. Each “Param” node represents the control parameter of the current data-processing service, where the “type” attribute and the “value” attribute indicate the type and value of the control parameter, respectively. The “OutputCollection” node consists of several “Output” nodes. Each “Output” node represents the output of the current data-processing method; its “id” attribute is used to uniquely identify the output, and the “target” attribute is used to indicate the server where the output is stored. (3) The “LinkCollection” node records the data-processing logic of the current data-processing workflow, which is composed of several “Link” nodes. Each “Link” node records the connection sequence between two nodes in the data-processing workflow. The “from” attribute indicates the “id” of the “start” node, and the “to” attribute indicates the “id” of the “end” node. Thus, when the data-computing solution is executed, the execution order of the data-processing services can be determined by traversing the “LinkCollection” node.
At this point, an executable data-computing solution is generated. A computing solution is used to provide data resources for specific data requirements, and users can save the computing solution to cope with changes in data requirements. At the same time, the computing solution can be easily shared with other users.
3.3. Design of Data-Computing Engine
The data-computing engine is designed for the execution of the data-computing solution, and its main goal is to ensure the safe and stable execution of the data-computing solution. Therefore, the core work content of the data-computing engine is to accurately analyze the requirements of the running environment before the execution of the computing solution, and to monitor and handle the exception in real time when the computing solution is running.
3.3.1. Generation of Data-Computing Task
To execute a data-computing solution, it is first necessary to understand the computing environment requirements for these data-processing tasks, that is, to create data-computing tasks. When the user calls the execution command of the data-computing solution in the workspace, the workspace will transfer the computing solution to the computing engine.
Firstly, the computing engine will traverse the entire data-computing solution node and collect metadata information of the data service corresponding to each node. The metadata of data resource services usually includes the size of the disk space occupied by the data resource, which can be used as a basis for allocating the size of the hard disk space for it. The metadata of data-processing services usually includes the hardware environment (such as CPU architecture, memory size, etc.) and software environment (such as the requirement to install Python 3.7 environment, etc.) that the execution of the data-processing program depends on. Based on this software and hardware dependency information, the computing engine will search for suitable server nodes from the network (which have already been registered in the computing engine). Server node matching is a complicated process, which needs a specially designed matching algorithm. The matching algorithm will score each server node based on the software and hardware requirements of the current computing task, and provide a list of candidate servers sorted by their scores. If a suitable server cannot be found, the user must manually configure the corresponding execution environment.
Subsequently, each data-processing service will generate a data-computing task, as shown in
Figure 7. The “serviceId” attribute of each “Task” node represents the data-processing service to be called by the task, and the “order” attribute represents the execution order of the task. For example, a value of “1” for “order” indicates that the task is executed first, and so on. The “Servers” Node indicates the server where the current task is running. There may be multiple “Server” nodes under the “Servers” node (that is, multiple servers meet the execution conditions of the current task). Each “Server” node has a “score” property that indicates how well the server node meets the computing environment requirements of the current task. Typically, the current computing task will select a server node with a high score as its execution environment. The “Dependency” node indicates the software and hardware requirements of the current computing task. For example, each “Environment” node under the “Software” node indicates the required software environment information. The “CPU”, “Memory”, “HardDisk”, and “Network” nodes under the “Hardware” node indicate the required hardware information.
3.3.2. Execution of Data-Computing Tasks
After all computing tasks are generated, they can be prepared for execution, as shown in
Figure 8. All computing tasks are stored in the task queue in order of priority. If the computing tasks have the same priority, they can be executed at the same time. Otherwise, the tasks with higher priority are executed first. When a task obtains the execution right, the resource files (such as data resources, data-processing method resources, etc.) required by the computing task are first synchronized from the data service container to the specified computing server. Due to the possibility of multiple tasks running simultaneously on a computing server, when system resources are insufficient to support the current task, it can lead to task failure. Therefore, before executing each task, it is necessary to determine whether the current server resources can support its execution. If the judgment result is true, the computing task can be started. Otherwise, the task will continue to wait for the appropriate execution time.
In order to avoid the interference between different computing tasks, this paper proposes a virtual computing container to manage the resources involved in the running of each computing task, including hardware resources such as CPU, memory, hard disk space, and related software resources such as Java runtime. The container process will monitor these software and hardware environments to ensure the correct execution of computing tasks. It also reports to the computing engine in real time when system resources are insufficient. In addition, the container process also manages the executing process and the monitoring process, with the former responsible for running computing tasks and the latter responsible for monitoring task execution.
When a data-computing task is started, its entire execution period is monitored by the monitoring process. The monitoring process determines the running status of the computing task by obtaining its status information (e.g., running, exception, finished, etc.). When a computing task encounters an exception, the monitoring process will check the running logs to determine the cause of the exception. For example, if the exception is due to insufficient system resources (such as memory, hard disk size), the monitoring process will re-run the task at the appropriate time. If the exception is caused by illegal input of a computing task, the monitoring process will directly report the exception to the user. In this case, all computing tasks that are executed after this computing task are suspended until the user reconfigures correct data for this computing task.