CN116860463A

CN116860463A - Distributed self-adaptive spaceborne middleware system

Info

Publication number: CN116860463A
Application number: CN202311134416.8A
Authority: CN
Inventors: 崔姝瑶; 汤昭荣; 唐晓瑜; 邱吉冰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-10-10

Abstract

The application discloses a distributed self-adaptive spaceborne middleware system, which is an innovative solution capable of greatly improving the stability and fault tolerance of the spaceborne system. The system can effectively cope with various faults and abnormal conditions in the satellite-borne system by establishing a multi-dimensional fault tolerance mechanism. The two modes of functional level fault tolerance and task level fault tolerance can be respectively processed and repaired correspondingly according to different fault conditions. Meanwhile, the system also adopts a distributed architecture, integrates the computing resources on a plurality of nodes, and realizes the load balance of tasks through a self-adaptive algorithm, thereby improving the utilization rate and performance of the system resources.

Description

Distributed self-adaptive spaceborne middleware system

Technical Field

The application belongs to the field of spaceborne computers, and particularly relates to a distributed self-adaptive spaceborne middleware system.

Background

The satellite-borne computer is a core calculation and control component of an electronic system on a satellite and is mainly responsible for the operation control, data processing, task planning and other matters in the orbit work of the satellite. The daily tasks are as follows: monitoring the running state of the satellite; processing and transmitting remote sensing image data acquired by a sensor; calculation of satellite orbits, etc. The stability of the satellite-borne computer serving as a core component for satellite information processing and data interaction is very important, and the performance of the whole satellite-borne system can be directly influenced.

Because of the special nature of the environment in which the spaceborne computer is located, electromagnetic interference, space particle radiation, impact, etc. can be suffered, and these disturbances can cause the spaceborne computer to break when performing tasks. In addition, high energy ionizing radiation can cause transient message destruction of on-board computer memory or logic elements that can generate electron-hole pairs when ionizing radiation particles penetrate the silicon substrate of the transistor. The diffusion and drift of the generated electron-hole pairs can cause charges to accumulate in the storage element, finally cause faults, enable the logic value of the device to be overturned, namely the original value of the bombarded part is 0, the value is overturned to be 1 after the bombarded part is bombarded, and the value can be overturned to be 0 from 1. Such a phenomenon that causes errors in stored information is commonly referred to as single event flipping (Single Event Upset, SEU).

Meanwhile, because of the uncertainty and the dynamic property of the distribution of the monitoring targets, the situation of mismatching of the calculation tasks and the calculation resources of the satellite-borne computer can occur, so that the large waste of calculation resources is caused. Because the current system architecture design of the on-board clusters does not support resource sharing and dynamic allocation, even if computing resources capable of meeting task demands exist on-board computers, the computing resources cannot be fully used, and a great challenge is brought to the high availability of the on-board computer systems.

Disclosure of Invention

Based on this, the application provides a distributed adaptive on-board middleware system, which is an innovative solution based on cloud computing ideas and aims to improve the performance and reliability of an on-board computer system. The system adopts a distributed architecture, integrates the computing resources on a plurality of nodes, and can automatically adjust the number and the allocation strategy of the computing nodes according to the actual computing resources and task demands so as to achieve the optimal computing performance and the optimal resource utilization efficiency. Meanwhile, the system has high availability and fault tolerance, and can realize rapid fault switching and resource redistribution under the condition of node fault or insufficient computing resources, thereby ensuring the stable operation of the system.

The embodiment of the application provides a distributed self-adaptive spaceborne middleware system, which comprises a ground central server and a plurality of spaceborne computers, wherein the ground central server and the spaceborne computers are clustered;

the ground central server is a main node of the cluster and comprises a resource detection module, a task management module and a node management module, wherein the resource detection module is used for monitoring available resources of the satellite-borne computer in real time, the task management module is used for managing tasks, and the node management module is used for adding and deleting working nodes;

the space-borne computer is a working node of the cluster and is used for establishing a virtual node after receiving the task dispatched by the ground central server and giving corresponding resources to the virtual node so as to process the task.

Further, the system also comprises a data storage device, wherein the data storage device stores task processing results and sensor data for sensing the satellite running environment and monitoring the satellite state.

Further, the system also comprises a scheduling unit, wherein the scheduling unit comprises a global scheduler and a local scheduler; the global scheduler is arranged on the ground central server and is used for performing task scheduling of each working node; the local schedulers are arranged on all the satellite-borne computers and are responsible for the establishment and scheduling of the local virtual nodes.

Further, in the task allocation stage, the global scheduler compares the resources required by the task to be processed with the available resources of each working node, if a single working node meets the resources required by the current task, the task is directly allocated to the working node, and a virtual node is created by a local scheduler on the working node to execute the machine learning task; if a single working node cannot meet the resources required by the task to be processed, the task to be processed is divided into a plurality of subtasks, and the subtasks are distributed to different working nodes for processing according to the idle condition of the resources of the different working nodes.

Further, if there is a time sequence association between different subtasks, the global scheduler issues a subsequent subtask after obtaining the staged results of the assigned subtasks.

Further, all task descriptions to be processed are stored in a message queue, when tasks are inferred, a work node consumes the tasks from the message queue, after the work node receives the tasks, task information is acquired according to the task descriptions and is inferred, and after the reasoning is successful, the completion of the processing of the messages in the queue is informed.

Further, if the task processing of the working node fails or is overtime, the task is added to the message queue again and is waited for being processed again, so that the situation that the task is interrupted and the failed task is not executed again is avoided.

Further, a backup queue is added to store the task description being executed, the task in the backup queue is deleted after the task execution is finished, and if the task execution fails or the service is overtime, the task is put back to the tail of the task queue.

Further, the working node performs error correction through a voter technique.

Further, the working node copies the tasks to be executed twice to obtain three identical tasks, the three tasks are executed respectively by inputting identical data, the execution results are input into a voter for comparison, the voter compares the output of each task bit by bit, and if the output of the three tasks is consistent, the voter outputs a corresponding result; if there is a circuit output that is inconsistent, the voter corrects the erroneous bits to output the correct result.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

according to the embodiment, the distributed computing and resource sharing are realized through the cooperative cooperation of the ground central server and the clusters of the plurality of spaceborne computers, and meanwhile, the distributed computing and resource sharing method has the advantages of high availability, fault tolerance, expandability, adaptivity and the like. The system can be widely applied to the fields of satellite communication, navigation, remote sensing and the like, greatly improves the running efficiency and stability of a satellite-borne computer system, reduces the failure rate and the maintenance cost, provides more stable and efficient support for the application of the satellite-borne computer, and also provides powerful support for the development of satellite application.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of a distributed adaptive on-board middleware system architecture, shown according to an example embodiment.

FIG. 2 is a schematic diagram illustrating a task distribution flow according to an example embodiment.

FIG. 3 is a flow chart illustrating dynamic allocation of tasks according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

The application provides a high-availability self-adaptive satellite-borne middleware system which is mainly suitable for improving the cluster performance of a satellite-borne computer in a satellite and a space station, and aims to solve the following technical problems:

aiming at the problem of space-borne resource waste caused by unbalance of computing resources, the application provides a system-level solution, provides an extensible modular architecture, monitors the service condition of the resources in real time, realizes self-adaptive dynamic allocation of the computing resources, and can autonomously add and delete computing nodes so as to adapt to different computing demands and task scales and fully utilize the space-borne resources.

Aiming at the problem that the operation result of a final target identification program is incorrect due to the storage information error caused by SEU, the application provides a method based on triple modular redundancy, which provides a plurality of copies for a circuit which is easy to be affected, and finally obtains correct output.

Aiming at the problem that the task fails or stops due to the interference received by the satellite-borne computer, the application provides a method for recycling and reprocessing the task which fails to be processed by utilizing a single-thread message queue.

FIG. 1 is a schematic diagram of a distributed adaptive on-board middleware system architecture that may be a ground central server and a number of on-board computers clustered as shown in FIG. 1, according to an example embodiment; the ground central server is a main node of the cluster and comprises a resource detection module, a task management module and a node management module, wherein the resource detection module is used for monitoring available resources of a satellite-borne computer in real time, the task management module is used for managing tasks and comprises adding and deleting tasks to be processed and the like, and the node management module is used for adding and deleting working nodes; the space-borne computer is a working node of the cluster and is used for establishing a virtual node after receiving the task dispatched by the ground central server and giving corresponding resources to the virtual node so as to process the task.

Specifically, the system can also comprise a data storage device, wherein the data storage device stores the data information acquired by the cameras and various sensors, and the data information is used for sensing the satellite running environment and monitoring the satellite state. Meanwhile, the task processing results are also stored in the data storage device. In the system, a ground central server and a satellite-borne computer are taken as cores and combined with data storage equipment to form an organic whole. The central server is used as a main node of the cluster and is responsible for controlling, distributing tasks, monitoring states and the like of the whole cluster. The space-borne computer is used as a working node of the cluster and is responsible for running tasks distributed by the main node.

In a specific implementation, the ground central server serves as a master node of the system and also plays a role in monitoring and managing the whole system. The method can monitor the states of all nodes in the cluster, including hardware, software, network and other aspects, so as to discover and process various abnormal conditions in time. The space-borne computer is used as a working node of the system and is responsible for executing specific computing tasks. In the system, each working node can independently execute tasks, and can also cooperate with other working nodes to jointly complete more complex computing tasks. When a certain working node fails or the task execution fails, the system can be automatically switched to other working nodes to execute the task, so that high availability and fault tolerance are realized.

In particular implementations, each of the on-board computers has its own local resources, including CPU, GPU, FPGA, disk, network, etc., on which to rely to run the assigned tasks.

When a task needs to be processed, the master node classifies the task, including different types such as computationally intensive, I/O intensive, communication intensive and the like, and then adaptively distributes the task to available working nodes for processing according to the requirements of the different types of tasks on computational resources, storage resources and communication resources.

Specifically, the system may further include a scheduling unit including a global scheduler and a local scheduler; the global scheduler is arranged on the ground central server and is used for ensuring that the resources obtained by each working node are reasonable and balanced; and each satellite-borne computer is provided with a local scheduler which is responsible for the establishment and scheduling of virtual nodes and cooperates with the virtual nodes to complete the scheduling of tasks.

In a specific implementation, when the system processes tasks, a virtualization technology is adopted to run in a mode of a virtual node, and part or all of computing resources (CPU, GPU and the like), storage resources and communication resources of a physical node are integrated into the virtual node for use by the corresponding tasks. When the computing resource of a node is insufficient to process a complex task, the system divides the task into a plurality of subtasks, and distributes the subtasks to a plurality of virtual nodes to execute the task, so that the task is completed on time.

Prior to task allocation, the system needs to monitor resource conditions. The system resource monitoring module is responsible for monitoring the use condition of system resources in real time, including the utilization rate of computing resources, storage resources and communication resources, and the like, and feeding back the real-time monitoring result to the task scheduling unit.

In practical application, the influence of network delay, data transmission and other factors on task scheduling needs to be considered. For example, assigning subtasks to nodes that are farther away may increase network latency, thereby affecting the efficiency of task operation. Therefore, the task scheduling policy needs to comprehensively consider various factors to realize an optimal task scheduling scheme.

In a word, the task scheduling module is a key component of the whole system, can adaptively allocate tasks to available working nodes for processing according to task types and resource requirements, and ensures the resource utilization rate and performance of the system through cooperation of a global scheduler and a local scheduler.

In the task allocation stage, the global scheduler compares the resources required by the task with the available resources of each working node, if a single working node can meet the resources required by the current task, the task is directly allocated to the working node, and a virtual node is then created by a local scheduler on the working node to execute the machine learning task. If the task to be processed is complex, the single working node cannot meet the required resources, the responsible task is divided into a plurality of subtasks, and the subtasks are distributed to different working nodes for processing according to the resource idle condition of the different working nodes. If there is a time sequence association between different sub-tasks, the global scheduler reissues the following sub-tasks after obtaining the staged results. The flow of task allocation is shown in fig. 2.

The system can effectively cope with various faults and abnormal conditions in the satellite-borne system by establishing a multi-dimensional fault tolerance mechanism. The two modes of functional level fault tolerance and task level fault tolerance can be respectively processed and repaired correspondingly according to different fault conditions.

In a task level fault-tolerant mode, task descriptions to be processed in the system are stored in a message queue, when a task is inferred, a work node consumes task messages from the message queue, after the work node takes the task, task information is obtained from a data storage device according to a path in the task descriptions, the reasoning is performed, and the completion of the processing of the queue messages is informed after the reasoning is successful. If the message processing fails or times out, the task can be re-queued and waiting for re-processing, thereby ensuring that there are no interrupts and that the failed task has not been re-executed.

Because of the nature of queuing single threads, the same message may not be consumed by multiple worker nodes at the same time when consuming tasks, but consider the case of unsuccessful consumption. Once the messages in the task queue are sent, they are removed from the queue. If the consumer does not receive the message for network reasons or if the consumer crashes during the processing of the message, the message cannot be restored. An ack acknowledgment mechanism is added. The specific implementation mode is as follows: the adding backup queue stores the task description being executed, and the task in the backup queue is deleted after the task execution is finished. If the task fails to execute or the service is overtime, the task is put back to the tail of the task queue. This ensures that each task will be consumed successfully. The flow of task execution is shown in fig. 3.

For example, the system now performs a task of picture processing, first, we put a task description to be processed into a task queue, and the system acquires picture information stored in a data memory, and takes a picture name as a task name.

Next, during task processing, in combination with the actual situation, we set the following cases to verify the reliability of the system, where machine A, B is the initial running machine and machine C is the standby machine:

the machine A, B was running smoothly and no errors occurred: at this time, the two machines consume the image processing tasks sequentially until all tasks in the task list are executed.

One machine is down, and then only one machine is in normal operation: in this case, the machine may crash to cause the task to fail, and the task that fails to be processed may be put back into the task list to be processed, and then be processed again by another machine.

C) The machine A, B is down and the third machine C is restarted: in this case, the task that failed to execute due to downtime of the machine A, B is put back into the task list to be processed, and then the machine C processes the task that failed to execute and did not start.

In the functional level fault tolerance mode, the voter is a common error correction technique that improves the reliability and fault tolerance of the circuit for the effects of SEU. In the voter technique, the original function module is duplicated to obtain multiple copies of the same function module, and then the output results of the function modules are compared and corrected to obtain the final correct output result. In the system, the working node copies the functional module to be reinforced (i.e. the task to be executed) twice, so as to obtain three identical functional modules. The function modules respectively calculate by inputting the same data, and input the calculation results into the voter for comparison. In the comparison process, the voter compares the output of each functional module bit by bit, if the outputs of the three functional modules are consistent, the voter outputs a corresponding result, and if one circuit output is inconsistent, the error bit is corrected to obtain a correct result. In this way, even if SEU occurs, the accuracy of the output result can be ensured.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

Claims

1. The distributed self-adaptive spaceborne middleware system is characterized by comprising a ground central server and a plurality of spaceborne computers, wherein the ground central server and the spaceborne computers are clustered;

2. The system of claim 1, further comprising a data storage device storing task processing results and sensor data for sensing the environment in which the satellite is operating and for satellite condition monitoring.

3. The system of claim 1, further comprising a scheduling unit comprising a global scheduler and a local scheduler; the global scheduler is arranged on the ground central server and is used for performing task scheduling of each working node; the local schedulers are arranged on all the satellite-borne computers and are responsible for the establishment and scheduling of the local virtual nodes.

4. A system according to claim 3, wherein in the task allocation phase, the global scheduler compares the resources required for the task to be processed with the available resources of each work node, and if a single work node meets the resources required for the current task, the task is directly allocated to the work node, and a local scheduler on the work node creates a virtual node to execute the machine learning task; if a single working node cannot meet the resources required by the task to be processed, the task to be processed is divided into a plurality of subtasks, and the subtasks are distributed to different working nodes for processing according to the idle condition of the resources of the different working nodes.

5. The system of claim 4, wherein if there is a time-sequential association between different sub-tasks, the global scheduler issues a subsequent sub-task after obtaining a staged result of the assigned sub-task.

6. The system of claim 1 wherein all task descriptions to be processed are stored in a message queue, and when tasks are inferred, the task is consumed by the work node from the message queue, and after the work node receives the task, the work node obtains task information and performs reasoning according to the task descriptions, and after the reasoning is successful, the queue is informed that the message processing is completed.

7. The system of claim 6, wherein if a task is processed by a worker node fails or times out, the task is re-entered into the message queue to await further processing, thereby ensuring that there are no interrupts and that the failed task has not been re-executed.

8. The system of claim 6 wherein a backup queue is added to store a description of the task being executed, the task in the backup queue is deleted after the task execution is completed, and if the task execution fails or the service times out, the task is placed back to the tail of the task queue.

9. The system of claim 1, wherein the working node performs error correction by a voter technique.

10. The system of claim 9, wherein the work node copies the task to be executed twice to obtain three identical tasks, executes the three tasks by inputting identical data, and inputs the execution results to the voter for comparison, the voter compares the output of each task bit by bit, and if the outputs of the three tasks are identical, the voter outputs the corresponding result; if there is a circuit output that is inconsistent, the voter corrects the erroneous bits to output the correct result.