CN117041132B

CN117041132B - A distributed load balancing satellite routing method based on deep reinforcement learning

Info

Publication number: CN117041132B
Application number: CN202311292725.8A
Authority: CN
Inventors: 吕志成; 岳洋; 倪少杰; 牟卫华; 唐小妹; 李宗楠; 李蓬蓬; 马春江
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2023-12-08
Anticipated expiration: 2043-10-08
Also published as: CN117041132A

Abstract

This application relates to a distributed load balancing satellite routing method based on deep reinforcement learning. On the one hand, it obtains the queue occupancy rate and queue status identification of neighbor satellite nodes by sensing the data cache queue load of neighbor satellite nodes; on the other hand, the neighbor nodes are Combined with the geographical distance factor and topological distance factor of the destination node, it is used to measure the proximity between the two. The resulting topological-geographical distance factor can better reflect the proximity between satellite nodes in the same hemisphere or different hemispheres. As a good driving factor for satellite network routing, it accelerates the convergence of routing algorithms, so that communication data flows can not only propagate along the shortest propagation distance path under low load conditions during the routing and forwarding process, but also bypass high-load nodes in time, thus Reduce node congestion as much as possible, achieve load balancing of data traffic transmission, reduce network packet loss rate and improve throughput.

Description

A distributed load balancing satellite routing method based on deep reinforcement learning

技术领域Technical field

本申请涉及无线通信技术领域，特别是涉及一种基于深度强化学习的分布式负载均衡卫星路由方法。This application relates to the field of wireless communication technology, and in particular to a distributed load balancing satellite routing method based on deep reinforcement learning.

背景技术Background technique

近年来，以“铱星”（Iridium）、“全球星”（GlobalStar）、“一网”（OneWeb）和“星链”（Starlink）等为代表的低轨卫星通信系统成为新一代卫星通信系统的主流发展趋势，并带动了全球低轨卫星星座的建设热潮。低轨卫星星座在卫星数量、轨道高度、空间环境、信号强度、批量研制、快速部署等方面存在的显著优势，将有可能带来卫星通信、导航、遥感、侦查、监视、预警探测等技术的重大变革，并影响空间电子信息领域的整体格局。以低轨卫星星座为核心的卫星网络摆脱了传统地面网络需要建站的物理限制，能够实现全球连续覆盖，并提供具有低延时、高速率、高功率和高通量等特征的实时通信服务。In recent years, low-orbit satellite communication systems represented by "Iridium", "GlobalStar", "OneWeb" and "Starlink" have become a new generation of satellite communication systems The mainstream development trend has led to the construction boom of global low-orbit satellite constellations. The significant advantages of the low-orbit satellite constellation in the number of satellites, orbital altitude, space environment, signal strength, batch development, rapid deployment, etc. will likely bring about the advancement of satellite communications, navigation, remote sensing, reconnaissance, surveillance, early warning detection and other technologies. Major changes and impact on the overall pattern of space electronic information field. Satellite networks with low-orbit satellite constellations as the core get rid of the physical limitations of traditional terrestrial networks that require station construction, can achieve global continuous coverage, and provide real-time communication services with low latency, high speed, high power and high throughput.

低轨卫星通信系统通过在卫星之间建立星间链路（Inter-Satellite Link，ISL）形成以卫星为交换节点的空间通信网络，实现地面站与卫星、地面用户之间的点对点通信。使用星间链路组网的卫星网络能够减少地面信关站的部署数量，降低地面段系统复杂度，形成相对地面独立的数据中继系统，从而真正成为空天地一体化信息网络中相对独立且重要的组成部分。对于包含大量卫星节点和星间链路的低轨卫星通信网络，路由策略的选择成为保证数据流及时准确完成传输和递交的关键。由于位于地面不同地理位置的用户之间、用户与地面站、地面站与地面站之间的通信数据，经由基于星间链路组网的卫星网络在卫星节点之间进行中继传输，而路由策略决定了数据流在卫星网络中继传输的路径选择。与地面网络不同，卫星网络中各个节点虽然作为路由转发节点，但是处于持续不断的运动中，并且地球与卫星星座保持相对运动，地面网络服务请求也存在地理分布不均匀的情况，因此，卫星互联网的流量传输具有高度动态性，很容易出现网络流量分布不均衡的现象。由于低轨卫星网络的动态性、网络带宽和存储资源的不均衡、网络流量负载的不均衡等因素，对于同一对源节点-目的节点来说，不同传输路径的通信质量大不相同，因此如何为每个用户的通信请求规划一条高效可靠的路径成为卫星网络路由技术研究的关键问题。The low-orbit satellite communication system forms a space communication network with satellites as switching nodes by establishing inter-satellite links (ISL) between satellites to achieve point-to-point communication between ground stations, satellites, and ground users. The satellite network using inter-satellite links can reduce the number of ground gateway stations deployed, reduce the complexity of the ground segment system, and form a relatively independent data relay system on the ground, thus truly becoming a relatively independent and independent information network in the space, space and ground integrated information network. important part. For low-orbit satellite communication networks that contain a large number of satellite nodes and inter-satellite links, the selection of routing strategies has become the key to ensuring the timely and accurate transmission and delivery of data flows. Because the communication data between users located in different geographical locations on the ground, between users and ground stations, and between ground stations, is relayed and transmitted between satellite nodes through the satellite network based on inter-satellite link networking, and routing The policy determines the path selection for data flow relay transmission in the satellite network. Different from the terrestrial network, although each node in the satellite network serves as a routing and forwarding node, it is in continuous motion, and the earth and the satellite constellation maintain relative motion. The terrestrial network service requests are also unevenly distributed geographically. Therefore, the satellite Internet Traffic transmission is highly dynamic, and it is easy for network traffic distribution to be uneven. Due to the dynamics of low-orbit satellite networks, imbalance of network bandwidth and storage resources, imbalance of network traffic load and other factors, for the same source node-destination node pair, the communication quality of different transmission paths is very different, so how to Planning an efficient and reliable path for each user's communication request has become a key issue in satellite network routing technology research.

发明内容Contents of the invention

基于此，有必要针对上述技术问题，提供一种基于深度强化学习的分布式负载均衡卫星路由方法，旨在解决低轨卫星通信系统考虑网络流量负载存在不均衡情况下的高效可靠的星间链路网络路由设计技术问题，设计一种能够实时感知低轨通信卫星网络状态和流量负载变化情况，并优化调整信号传播路径，实现卫星网络数据流量的负载均衡和高效传输。Based on this, it is necessary to address the above technical issues and provide a distributed load balancing satellite routing method based on deep reinforcement learning, aiming to solve the problem of efficient and reliable inter-satellite links in low-orbit satellite communication systems considering the imbalance of network traffic load. The technical issue of road network routing design is to design a system that can sense the network status and traffic load changes of low-orbit communication satellites in real time, and optimize and adjust the signal propagation path to achieve load balancing and efficient transmission of satellite network data traffic.

一种分布式负载均衡卫星路由方法，所述方法包括：A distributed load balancing satellite routing method, the method includes:

根据当前时隙的低轨卫星网络拓扑快照得到当前的星间链路；在星间链路中包含多个卫星节点；每一卫星节点对应一个路由决策智能体；The current inter-satellite link is obtained based on the low-orbit satellite network topology snapshot of the current time slot; the inter-satellite link contains multiple satellite nodes; each satellite node corresponds to a routing decision-making agent;

在当前时隙下的星间链路中，根据当前卫星节点的邻居卫星节点和目标卫星节点的空间坐标计算得到邻居卫星节点和目标卫星节点的地理距离因子，根据当前卫星节点的邻居卫星节点和目标卫星节点的虚拟拓扑地址计算得到邻居卫星节点和目标卫星节点的拓扑距离因子，根据地理距离因子和对应的拓扑距离因子得到拓扑-地理距离因子；其中，虚拟拓扑地址包括卫星节点的所在轨道编号及其轨道内顺序编号；In the inter-satellite link under the current time slot, the geographical distance factor of the neighbor satellite node and the target satellite node is calculated based on the spatial coordinates of the current satellite node's neighbor satellite node and the target satellite node. According to the current satellite node's neighbor satellite node and The virtual topological address of the target satellite node is calculated to obtain the topological distance factor of the neighboring satellite node and the target satellite node, and the topological-geographical distance factor is obtained according to the geographical distance factor and the corresponding topological distance factor; among which, the virtual topological address includes the orbit number of the satellite node. and its sequence number within the track;

将每一邻居卫星节点的队列状态标识符、队列占用率及其与目标卫星节点的拓扑-地理距离因子作为当前时隙下的观测状态信息并输入当前卫星节点对应的路由决策智能体的深度神经网络中，输出当前卫星节点将数据流转发至各个邻居卫星节点的状态-动作价值；The queue status identifier, queue occupancy rate and topology-geographic distance factor of each neighbor satellite node and the target satellite node are used as the observation status information under the current time slot and input into the deep neural network of the routing decision agent corresponding to the current satellite node. In the network, output the status-action value of the current satellite node forwarding the data stream to each neighbor satellite node;

选取最大状态-动作价值对应的邻居卫星节点作为当前卫星节点的下一跳转发节点，完成当前时隙的数据转发任务。Select the neighbor satellite node corresponding to the maximum state-action value as the next-hop forwarding node of the current satellite node to complete the data forwarding task of the current time slot.

上述基于深度强化学习的分布式负载均衡卫星路由方法，一方面通过感知邻居卫星节点的数据缓存队列负载情况，得到邻居卫星节点的队列占用率和队列状态标识；一方面将邻居节点与目的节点的地理距离因子和拓扑距离因子结合起来用于衡量两者的接近程度，其中，拓扑距离因子考虑了2个卫星节点的轨道面接近程度和轨道内顺次接近程度，因此拓扑-地理距离因子能够更好地反映处于同半球或不同半球的卫星节点之间的接近程度，可以作为卫星网络路由良好的驱动因子加快路由算法收敛，从而使得通信数据流在路由转发过程中既能够在低负载情况下按照最短传播距离的路径进行传播，又能够及时绕开高负载的节点，从而尽可能减小节点拥塞情况，实现数据流量传输的负载均衡，减小网络丢包率并提高吞吐量。The above-mentioned distributed load balancing satellite routing method based on deep reinforcement learning, on the one hand, obtains the queue occupancy rate and queue status identification of neighbor satellite nodes by sensing the data cache queue load of neighbor satellite nodes; on the other hand, it combines the relationship between neighbor nodes and destination nodes. The geographical distance factor and the topological distance factor are combined to measure the proximity between the two. The topological distance factor takes into account the orbital surface proximity and intra-orbital sequential proximity of the two satellite nodes, so the topological-geographical distance factor can be more accurate. It can well reflect the proximity between satellite nodes in the same hemisphere or different hemispheres, and can be used as a good driving factor for satellite network routing to speed up the convergence of the routing algorithm, so that the communication data flow can be processed according to the requirements under low load conditions during the routing and forwarding process. Propagate through the path with the shortest propagation distance, and can avoid high-load nodes in time, thereby minimizing node congestion, achieving load balancing of data traffic transmission, reducing network packet loss rate and improving throughput.

附图说明Description of the drawings

图1为一种基于深度强化学习的分布式负载均衡卫星路由方法的流程示意图；Figure 1 is a flow chart of a distributed load balancing satellite routing method based on deep reinforcement learning;

图2为一种基于地理位置的“贪婪”路由策略示意图；Figure 2 is a schematic diagram of a "greedy" routing strategy based on geographical location;

图3为强化学习技术框架示意图；Figure 3 is a schematic diagram of the reinforcement learning technology framework;

图4为一种基于深度强化学习的分布式负载均衡卫星路由方法的深度神经网络结构；Figure 4 shows the deep neural network structure of a distributed load balancing satellite routing method based on deep reinforcement learning;

图5为一种基于深度强化学习的分布式负载均衡卫星路由方法的系统实现架构；Figure 5 shows the system implementation architecture of a distributed load balancing satellite routing method based on deep reinforcement learning;

图6为一个实施例中计算机设备的内部结构图。Figure 6 is an internal structure diagram of a computer device in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

在一个实施例中，如图1所示，提供了一种基于深度强化学习的分布式负载均衡卫星路由方法，包括以下步骤：In one embodiment, as shown in Figure 1, a distributed load balancing satellite routing method based on deep reinforcement learning is provided, including the following steps:

步骤102，根据当前时隙的低轨卫星网络拓扑快照得到当前的星间链路。Step 102: Obtain the current inter-satellite link based on the LEO satellite network topology snapshot of the current time slot.

其中，在星间链路中包含多个卫星节点，且每一卫星节点对应一个路由决策智能体。Among them, the inter-satellite link contains multiple satellite nodes, and each satellite node corresponds to a routing decision-making agent.

使用虚拟拓扑策略的拓扑控制方法将低轨卫星网络拓扑划分为多个时隙，每个时隙内卫星网络可视为静态拓扑，即卫星节点之间的连接关系不发生改变，每个时隙内的静态拓扑即为对应时隙下的星间链路。The topology control method using virtual topology strategy divides the low-orbit satellite network topology into multiple time slots. The satellite network in each time slot can be regarded as a static topology, that is, the connection relationship between satellite nodes does not change. The static topology in is the inter-satellite link under the corresponding time slot.

步骤104，在当前时隙下的星间链路中，根据当前卫星节点的邻居卫星节点和目标卫星节点的空间坐标计算得到邻居卫星节点和目标卫星节点的地理距离因子，根据当前卫星节点的邻居卫星节点和目标卫星节点的虚拟拓扑地址计算得到邻居卫星节点和目标卫星节点的拓扑距离因子，根据地理距离因子和对应的拓扑距离因子得到拓扑-地理距离因子。Step 104: In the inter-satellite link under the current time slot, calculate the geographical distance factor of the neighbor satellite node and the target satellite node according to the spatial coordinates of the neighbor satellite node and the target satellite node of the current satellite node. According to the neighbors of the current satellite node The virtual topological addresses of the satellite node and the target satellite node are calculated to obtain the topological distance factor of the neighbor satellite node and the target satellite node, and the topological-geographical distance factor is obtained according to the geographical distance factor and the corresponding topological distance factor.

其中，虚拟拓扑地址包括卫星节点的所在轨道编号及其轨道内顺序编号。Among them, the virtual topology address includes the orbit number of the satellite node and its sequence number within the orbit.

如图2所示，提供一种基于地理位置的“贪婪”路由策略示意图。在低轨卫星网络中，对于在源节点S产生且目的节点为D的数据流，若每跳转发均选择地理空间上距离目的节点更接近的卫星节点，则/>将经历最少的转发跳数。在绝大多数情况下最短传播距离路径包含于多条最小跳数路径中，且最短传播距离路径与最小跳数路径在传播时延上差别不大。因此，通过地理距离不断接近目的节点所产生的“贪婪”路由策略，能够快速地寻找到达目的节点D的最小跳数路径，从而实现最优的时延性能。As shown in Figure 2, a schematic diagram of a "greedy" routing strategy based on geographical location is provided. In the low-orbit satellite network, for the data flow generated at the source node S and the destination node is D , if each hop forwarding selects a satellite node that is geographically closer to the destination node, then/> Will experience the minimum number of forwarding hops. In most cases, the shortest propagation distance path is included in multiple minimum hop count paths, and there is not much difference in propagation delay between the shortest propagation distance path and the minimum hop count path. Therefore, through the "greedy" routing strategy generated by the continuous geographical distance to the destination node, the minimum hop count path to the destination node D can be quickly found, thereby achieving optimal delay performance.

由图2可知，当数据流位于节点/>且目的节点为/>时，选择节点/>作为下一跳节点在地理位置上将更加逼近目的节点/>。因此，在网络负载较轻的场景下，这种基于地理位置的“贪婪”选择策略将有利于快速找到转发跳数最小的传输路径，实现数据流/>经历的端到端时延最小化。除了用空间坐标计算卫星节点之间的接近程度外，由于低轨卫星网络拓扑的网格特性，卫星节点之间的传输距离还可以通过网格坐标进行衡量。卫星网络的网格化能够将球形的空间网络拓扑建模为二维平面的网格坐标，能够通过拓扑坐标衡量两个卫星节点之间的靠近程度。因此，本发明提出一种结合地理距离和拓扑距离的度量方法来表征两个节点之间的接近程度，用于驱动路由计算，即拓扑-地理距离因子（Topologicand Geographic Distance Factor，TGDF）。拓扑与地理距离因子TGDP由拓扑距离因子（Topologic Distance Factor，TDF）和地理距离因子（Geographic Distance Factor，GDF）两部分组成。在拓扑时隙/>内，卫星节点/>的三维地理坐标和虚拟拓扑地址分别为和/>，其中/>为轨道编号，/>为轨道内顺序编号。As can be seen from Figure 2, when the data flow Located at node/> And the destination node is/> When , select node/> As the next hop node, it will be geographically closer to the destination node/> . Therefore, in scenarios with light network load, this "greedy" selection strategy based on geographical location will help quickly find the transmission path with the smallest number of forwarding hops and realize data flow/> The end-to-end delay experienced is minimized. In addition to using spatial coordinates to calculate the proximity between satellite nodes, due to the grid characteristics of low-orbit satellite network topology, the transmission distance between satellite nodes can also be measured by grid coordinates. The gridding of satellite networks can model the spherical space network topology into two-dimensional plane grid coordinates, and can measure the proximity between two satellite nodes through topological coordinates. Therefore, the present invention proposes a measurement method that combines geographical distance and topological distance to characterize the proximity between two nodes and is used to drive routing calculations, namely Topologic and Geographic Distance Factor (TGDF). Topological and geographical distance factors TGDP consists of two parts: topological distance factor (Topologic Distance Factor, TDF) and geographical distance factor (Geographic Distance Factor, GDF). In topological slot/> Within, satellite node/> The three-dimensional geographical coordinates and virtual topological address of are respectively and/> , of which/> Number the track,/> Number the sequence within the track.

使用TGDF值，而不是单纯使用TDF或者GDF的优点在于：TGDF能够很好地反映处于同半球或不同半球的卫星节点之间的接近程度，因此可以使用TGDF作为卫星网络路由良好的驱动因子加快路由算法收敛。The advantage of using TGDF values instead of simply using TDF or GDF is that TGDF can well reflect the proximity between satellite nodes in the same hemisphere or different hemispheres. Therefore, TGDF can be used as a driving factor for good satellite network routing to speed up routing. The algorithm converges.

步骤106，将每一邻居卫星节点的队列状态标识符、队列占用率及其与目标卫星节点的拓扑-地理距离因子作为当前时隙下的观测状态信息并输入当前卫星节点对应的路由决策智能体的深度神经网络中，输出当前卫星节点将数据流转发至各个邻居卫星节点的状态-动作价值。Step 106: Use the queue status identifier, queue occupancy rate and topology-geographic distance factor of each neighbor satellite node and the target satellite node as the observation status information under the current time slot and input it into the routing decision agent corresponding to the current satellite node. In the deep neural network, the state-action value of the current satellite node forwarding the data stream to each neighbor satellite node is output.

在基于TCP/IP的传统卫星网络通信场景中，传输层协议中对于拥塞控制大多数是“消极、被动的”，即只有当节点出现了拥塞和丢包后才告知网络其他节点采取拥塞控制策略，减小对拥塞节点的数据发送速率，因而拥塞控制存在严重的滞后性，无法在节点即将发生拥塞和丢包时进行提前预知和路径规避。In the traditional satellite network communication scenario based on TCP/IP, most of the congestion control in the transport layer protocol is "passive and passive", that is, only when a node experiences congestion and packet loss, other nodes in the network are notified to adopt a congestion control strategy. , reducing the data sending rate to congested nodes, so congestion control has serious lag, and it is impossible to predict and avoid paths in advance when node congestion and packet loss are about to occur.

为了及时感知网络各节点的流量负载情况从而预测节点内部的队列时延，使数据流在进行每一跳转发时能够综合衡量传播时延与队列时延，本方法通过节点间的状态信息交换，使得卫星节点能够及时感知周围节点的队列负载情况和到达目的节点的距离。其中队列负载情况包括队列状态标识符和队列占用率，到达目的节点的距离即为拓扑-地理距离因子。In order to timely sense the traffic load of each node in the network and predict the queue delay within the node, the data flow Propagation delay and queue delay can be comprehensively measured when forwarding each hop. This method enables satellite nodes to timely sense the queue load of surrounding nodes and the distance to the destination node through the exchange of status information between nodes. The queue load condition includes the queue status identifier and queue occupancy rate, and the distance to the destination node is the topological-geographical distance factor.

步骤108，选取最大状态-动作价值对应的邻居卫星节点作为当前卫星节点的下一跳转发节点，完成当前时隙的数据转发任务。Step 108: Select the neighbor satellite node corresponding to the maximum state-action value as the next-hop forwarding node of the current satellite node to complete the data forwarding task of the current time slot.

上述基于深度强化学习的分布式负载均衡卫星路由方法中，一方面通过感知邻居卫星节点的数据缓存队列负载情况，得到邻居卫星节点的队列占用率和队列状态标识；一方面将邻居节点与目的节点的地理距离因子和拓扑距离因子结合起来用于衡量两者的接近程度，其中，拓扑距离因子考虑了2个卫星节点的轨道面接近程度和轨道内顺次接近程度，因此拓扑-地理距离因子能够更好地反映处于同半球或不同半球的卫星节点之间的接近程度，可以作为卫星网络路由良好的驱动因子加快路由算法收敛，从而使得通信数据流在路由转发过程中既能够在低负载情况下按照最短传播距离的路径进行传播，又能够及时绕开高负载的节点，从而尽可能减小节点拥塞情况，实现数据流量传输的负载均衡，减小网络丢包率并提高吞吐量。In the above-mentioned distributed load balancing satellite routing method based on deep reinforcement learning, on the one hand, the queue occupancy rate and queue status identification of the neighbor satellite nodes are obtained by sensing the data cache queue load of the neighbor satellite nodes; on the other hand, the neighbor nodes and the destination node are The geographical distance factor and the topological distance factor are combined to measure the proximity of the two. Among them, the topological distance factor takes into account the orbital surface proximity and the sequential proximity within the orbit of the two satellite nodes. Therefore, the topological-geographical distance factor can Better reflecting the proximity between satellite nodes in the same hemisphere or different hemispheres, it can be used as a driving factor for good satellite network routing to speed up the convergence of the routing algorithm, so that the communication data flow can be processed under low load conditions during the routing and forwarding process. Propagate according to the path with the shortest propagation distance, and can avoid high-load nodes in time, thereby minimizing node congestion, achieving load balancing of data traffic transmission, reducing network packet loss rate and improving throughput.

在一个实施例中，采用HELLO/ACK探测应答报文机制使得卫星节点之间通过星间链路进行状态信息交换；所述状态信息包括：卫星节点的IP地址、虚拟拓扑地址、空间坐标、队列占用率、时间戳。In one embodiment, the HELLO/ACK detection response message mechanism is used to exchange status information between satellite nodes through inter-satellite links; the status information includes: the IP address of the satellite node, virtual topology address, spatial coordinates, queue Occupancy rate, timestamp.

卫星节点通过星间链路接收邻居卫星节点/>传来的HELLO通知报文并解析其中包含节点/>的状态信息，同时卫星节点/>也向邻居节点/>发送包含自身状态信息的ACK应答报文。HELLO/ACK报文中包含了发送节点的IP地址、虚拟拓扑地址、空间坐标、队列占用率、时间戳等信息。satellite node Receive neighbor satellite nodes via inter-satellite links/> The incoming HELLO notification message is parsed and contains the node/> status information of satellite nodes/> Also to neighbor nodes/> Send an ACK response message containing its own status information. The HELLO/ACK message contains the IP address, virtual topology address, spatial coordinates, queue occupancy, timestamp and other information of the sending node.

在低轨卫星网络中，HELLO/ACK报文在卫星网络中的传输优先级高于业务数据包，保证各个卫星节点能够及时接收到来自周围节点的状态信息，从而确保分布式路由决策具有较好的实时性，能够快速应对高动态的卫星网络环境。为保证链路状态更新的实时性，卫星节点将以时间间隔周期性地向邻居节点发送HELLO通知报文，以保证邻居节点能够实时掌握队列和链路状态信息以辅助路由决策。当卫星节点向邻居节点发送HELLO通知报文时，将设置监测周期/>，若在监测周期/>内接收到邻居节点反馈的ACK应答报文，则表示与邻居节点能够建立通信，并且将过时的HELLO确认包丢弃；否则，若经过监测周期/>仍然没有收到邻居节点反馈的HELLO_ACK确认报文，则认为与邻居节点的通信链路中断，此时在本节点维护的该邻居节点运行状态表中将链路状态设置为“断开”，直到接收到该节点反馈的HELLO_ACK确认报文才重新建立通信连接。基于HELLO通知报文的邻居节点队列监测策略能够及时感知相邻节点的运行状态从而估计邻居节点的队列时延，为数据包的下一跳转发提供充足的路由决策信息。In the low-orbit satellite network, the transmission priority of HELLO/ACK messages in the satellite network is higher than that of business data packets, ensuring that each satellite node can receive status information from surrounding nodes in a timely manner, thereby ensuring that distributed routing decisions have better It has real-time performance and can quickly respond to highly dynamic satellite network environments. In order to ensure the real-time update of link status, the satellite node will update the link status in time intervals. Periodically send HELLO notification messages to neighbor nodes to ensure that neighbor nodes can grasp queue and link status information in real time to assist in routing decisions. When the satellite node sends a HELLO notification message to the neighbor node, the monitoring period/> , if during the monitoring period/> If the ACK response message fed back by the neighbor node is received within 10 seconds, it means that communication with the neighbor node can be established, and the outdated HELLO confirmation packet will be discarded; otherwise, if the monitoring cycle/> If the HELLO_ACK confirmation message from the neighbor node is still not received, it is considered that the communication link with the neighbor node is interrupted. At this time, the link status is set to "disconnected" in the neighbor node operating status table maintained by this node until The communication connection is re-established only after receiving the HELLO_ACK confirmation message fed back by the node. The neighbor node queue monitoring strategy based on HELLO notification messages can timely sense the operating status of neighboring nodes to estimate the queue delay of neighbor nodes, and provide sufficient routing decision information for the next hop forwarding of data packets.

在一个实施例中，根据当前卫星节点的邻居卫星节点和目标卫星节点的空间坐标计算得到邻居卫星节点和目标卫星节点的地理距离因子为：In one embodiment, the geographical distance factor of the neighbor satellite node and the target satellite node is calculated based on the spatial coordinates of the current satellite node's neighbor satellite node and the target satellite node:

； ;

其中，表示时隙/>下的第/>个卫星节点的邻居卫星节点，/>表示目标卫星节点，/>表示时隙/>下的第/>个卫星节点的邻居卫星节点的空间坐标，表示目标卫星节点的空间坐标，/>表示时隙/>下的第/>个卫星节点的邻居卫星节点和目标卫星节点的地理距离因子，/>表示光速。in, Indicates time slot/> The next page/> neighbor satellite nodes of satellite nodes,/> Indicates the target satellite node,/> Indicates time slot/> The next page/> The spatial coordinates of neighbor satellite nodes of satellite nodes, Indicates the spatial coordinates of the target satellite node,/> Indicates time slot/> The next page/> The geographical distance factor between the neighbor satellite node of a satellite node and the target satellite node,/> represents the speed of light.

由上式可知，GDF值与节点之间星间链路传播时延相等，因此直观反映了两节点之间的地理距离。It can be seen from the above equation that the GDF value is equal to the propagation delay of the inter-satellite link between nodes, so it intuitively reflects the geographical distance between the two nodes.

在一个实施例中，根据当前卫星节点的邻居卫星节点和目标卫星节点的虚拟拓扑地址计算得到邻居卫星节点和目标卫星节点的拓扑距离因子为：In one embodiment, the topological distance factor of the neighbor satellite node and the target satellite node is calculated based on the virtual topological addresses of the current satellite node's neighbor satellite node and the target satellite node:

； ;

其中，表示时隙/>下的第/>个卫星节点的邻居卫星节点的所在轨道编号，表示时隙/>下的第/>个卫星节点的邻居卫星节点的轨道内顺序编号，/>表示目标卫星节点的所在轨道编号，/>表示目标卫星节点的轨道内顺序编号，/>表示时隙/>下的第/>个卫星节点的邻居卫星节点和目标卫星节点的拓扑距离因子，/>表示轨道面接近程度对拓扑距离因子的影响因子，/>表示轨道内顺次接近程度对拓扑距离因子的影响因子，/>表示每个轨道内的卫星数。in, Indicates time slot/> The next page/> The orbit number of the neighbor satellite node of a satellite node, Indicates time slot/> Next page/> The orbital sequence number of the neighbor satellite node of a satellite node,/> Indicates the orbit number of the target satellite node,/> Represents the sequence number within the orbit of the target satellite node,/> Indicates time slot/> Next page/> The topological distance factor between the neighbor satellite node of a satellite node and the target satellite node,/> Indicates the influence factor of orbital surface proximity on topological distance factor,/> Represents the influence factor of the sequential proximity within the orbit on the topological distance factor,/> Indicates the number of satellites in each orbit.

由上式可知，越小则节点/>和/>所在轨道面越接近，TDF值越小，反映/>与/>越接近；/>表示节点/>和/>在南北方向上的最小跳数，当/>和/>的轨道内顺次越接近，TDF值越小，也表明/>和/>之间距离越小。It can be seen from the above formula that, The smaller the node/> and/> The closer the orbital surfaces are, the smaller the TDF value reflects/> with/> The closer;/> Represents node/> and/> The minimum number of hops in the north-south direction, when/> and/> The closer the order within the orbit, the smaller the TDF value, which also shows that/> and/> The smaller the distance between them.

在一个实施例中，根据地理距离因子和对应的拓扑距离因子得到拓扑-地理距离因子为：In one embodiment, the topological-geographical distance factor obtained based on the geographical distance factor and the corresponding topological distance factor is:

； ;

其中，表示时隙/>下的第/>个卫星节点的邻居卫星节点和目标卫星节点的拓扑-地理距离因子，/>表示拓扑距离因子对拓扑-地理距离因子的影响因子，/>表示地理距离因子对拓扑-地理距离因子的影响因子。in, Indicates time slot/> The next page/> The topology-geographic distance factor of the neighbor satellite node and the target satellite node of the satellite node,/> Represents the influence factor of topological distance factor on topological-geographical distance factor,/> Indicates the influence factor of geographical distance factor on topological-geographic distance factor.

在一个实施例中，将每一邻居卫星节点的队列状态标识符、队列占用率及其与目标卫星节点的拓扑-地理距离因子作为当前时隙下的观测状态信息并输入当前卫星节点对应的路由决策智能体的深度神经网络中。In one embodiment, the queue status identifier of each neighbor satellite node, the queue occupancy rate, and the topological-geographical distance factor from the target satellite node are used as the observation status information under the current time slot and the route corresponding to the current satellite node is input. In deep neural networks for decision-making agents.

获取预先构建的奖励函数；奖励函数的表达式为：Get a pre-built reward function; the expression for the reward function is:

； ;

其中，表示时隙/>下的第/>个卫星节点的奖励函数，/>表示时隙/>下的第/>个卫星节点的邻居卫星节点的队列状态标识，/>表示时隙/>下的第/>个卫星节点的邻居卫星节点的队列占用率。in, Indicates time slot/> The next page/> Reward function of satellite nodes,/> Indicates time slot/> The next page/> The queue status identifier of the neighbor satellite node of the satellite node,/> Indicates time slot/> The next page/> Queue occupancy rate of neighbor satellite nodes of a satellite node.

奖励函数作为强化学习中驱动智能体修正动作策略的“指挥棒”，对于路由决策智能体，选择某个相邻建链节点作为下一跳转发节点这一动作所获得的奖励直接反映该下一跳节点到达目的节点的潜在端到端时延性能。将所选下一跳节点与目的节点之间的TGDF以及下一跳节点的队列占用率QOR纳入奖励函数中用于综合评估节点的端到端时延性能。式中/>为路由决策智能体所选择的下一跳节点，由奖励函数表达式可知，当下一跳节点就是目的节点时，智能体获得很大的正奖励，表明传输任务已完成；当下一跳节点不是目的节点时，智能体将获得负奖励，路由传输过程中的负奖励能够激励智能体的路由决策尽可能快地逼近目的节点；当所选择下一跳节点的队列状态标识/>时，表明该节点处于“空闲”状态，奖励函数仅与下一跳节点与目的节点之间TGDF有关，TGDF越小表明距离目的节点越近，因此获得的负奖励越小；当下一跳节点的队列状态标识/>时，表明该节点处于“中等负载”状态，此时奖励函数开始考虑节点的队列占用率QOR；当下一跳节点的队列状态标识/>时，表明该节点处于“繁忙”状态，此时下一跳节点队列占用率对奖励函数的影响权重将进一步提高，表明此时下一跳节点的队列时延在端到端时延性能中起主导作用。The reward function serves as the "baton" that drives the agent's corrective action strategy in reinforcement learning. For the routing decision-making agent, the reward obtained by selecting an adjacent link-building node as the next-hop forwarding node directly reflects the next step. The potential end-to-end delay performance of a hop node to the destination node. The TGDF between the selected next-hop node and the destination node and the queue occupancy rate QOR of the next-hop node are included in the reward function to comprehensively evaluate the end-to-end delay performance of the node. . Formula in/> The next hop node selected by the routing decision-making agent can be seen from the reward function expression. When the next hop node is the destination node, the agent gets a large positive reward, indicating that the transmission task has been completed; when the next hop node is not the destination node When the node is reached, the agent will receive a negative reward. The negative reward during the route transmission process can encourage the agent's routing decision to approach the destination node as quickly as possible; when the queue status indicator of the selected next hop node /> When , it indicates that the node is in the "idle" state, and the reward function is only related to the TGDF between the next hop node and the destination node. The smaller the TGDF, the closer it is to the destination node, so the smaller the negative reward obtained; when the next hop node Queue status identifier/> When , it indicates that the node is in the "medium load" state. At this time, the reward function begins to consider the queue occupancy rate QOR of the node; the queue status identifier of the current next-hop node/> When , it indicates that the node is in a "busy" state. At this time, the impact weight of the next-hop node queue occupancy rate on the reward function will be further increased, indicating that the queue delay of the next-hop node plays a leading role in the end-to-end delay performance at this time. .

根据观测状态信息和奖励函数得到当前卫星节点将数据流转发至各个邻居卫星节点的状态-动作价值。According to the observed state information and reward function, the state-action value of the current satellite node forwarding the data flow to each neighbor satellite node is obtained.

在一个实施例中，构建所述路由决策智能体的深度神经网络的步骤包括：In one embodiment, the step of constructing the deep neural network of the routing decision agent includes:

构建第一深度神经网络和第二深度神经网络；第一深度神经网络和第二深度神经网络的网络结构相同；其中，第一深度神经网络作为主网络，用于实时估计状态-动作价值；第二深度神经网络作为目标网络，用于计算目标状态-动作价值并提供给第一深度神经网络用于更新网络参数。Constructing a first deep neural network and a second deep neural network; the network structures of the first deep neural network and the second deep neural network are the same; among them, the first deep neural network is used as the main network for real-time estimation of state-action value; The second deep neural network serves as the target network and is used to calculate the target state-action value and provide it to the first deep neural network for updating network parameters.

状态-动作价值是指基于贝尔曼方程下的最优的状态-动作价值目标函数Q值，计算公式为：State-action value refers to the optimal state-action value objective function Q value based on the Bellman equation. The calculation formula is:

； ;

其中是目标Q值，/>为奖励函数，/>为估计/>网络，/>为下一个观测状态，为目标/>网络,/>为估计/>网络的参数，/>为目标/>网络的参数。in is the target Q value,/> is the reward function,/> is an estimate/> Network,/> is the next observation state, for target/> Network,/> is an estimate/> Network parameters,/> for target/> Network parameters.

其中一个DNN为目标Q值网络，另一个DNN为估计Q值网络，每训练一段时间会将估计Q值网络的参数全部赋值给目标Q值网络，用于决定下一跳转发节点的网络是估计Q网络，目标Q网络用于为估计Q网络参数的更新更加合理，防止估计Q网络在训练过程中对Q值的过高估计。One DNN is the target Q-value network, and the other DNN is the estimated Q-value network. Every training period, all parameters of the estimated Q-value network will be assigned to the target Q-value network. The network used to determine the next hop forwarding node is Estimated Q network, the target Q network is used to make the update of the estimated Q network parameters more reasonable and prevent the estimated Q network from overestimating the Q value during the training process.

本方法采用经典的深度强化学习Double-DQN（Double Deep Q-Network）算法实现基于TGDF驱动的卫星网络路由算法。Double-DQN算法将深度神经网络（Deep NeuralNetworks，DNN）与强化学习的Q-Learning算法进行结合，用DNN强大的非线性拟合能力对状态-动作价值函数Q值进行拟合，从而解决Q-Learning算法中随着状态空间增大记录了状态-动作价值函数的Q值表格所需存储空间急剧增大从而算法收敛速率显著降低的问题。This method uses the classic deep reinforcement learning Double-DQN (Double Deep Q-Network) algorithm to implement a satellite network routing algorithm based on TGDF driver. The Double-DQN algorithm combines Deep Neural Networks (DNN) with the Q-Learning algorithm of reinforcement learning, and uses the powerful nonlinear fitting ability of DNN to fit the Q value of the state-action value function, thereby solving the problem of Q- In the Learning algorithm, as the state space increases, the storage space required for the Q-value table that records the state-action value function increases sharply, resulting in a significant reduction in the convergence rate of the algorithm.

强化学习是机器学习的分支之一，主要用于解决连续决策的问题，强化学习可以在复杂的、不确定的环境中学习如何实现设定的目标。作为一种“奖惩”式的学习算法，强化学习不依赖于系统的先验知识，也不需要建立精确的数学模型，避免了模型设计不准确的问题，只是利用环境给出的奖惩信号来训练算法的参数。强化学习的基本原理是：如果决策主体执行某个动作后收到环境的正奖励信号（强化信号），则决策主体执行当前动作的趋势会加强；反之，决策主体执行这个动作的趋势会减弱，强化学习技术框架如图3所示。强化学习的框架主要由智能体（Agent）、环境（Enviroment）、状态（State）、动作（Action）以及奖励（Reward）五部分组成。智能体是强化学习的核心，具备感知和学习的能力；状态是智能体在当前环境下所处的参数形态；动作是受智能体支配的执行操作；环境是智能体所探索的对象；奖励是根据环境提供的信息对智能体在当前状态下决策行为的评判。智能体执行了某个动作后，环境将会转换到一个新的状态，对于该新的状态环境会给出奖励信号（正奖励或负奖励），按照一定学习率更新智能体的策略并执行新的动作，重复此过程，直到智能体学习到能够得到最大奖励收益的决策策略。上述过程为智能体和环境通过状态、动作、奖励进行交互的方式。Reinforcement learning is one of the branches of machine learning and is mainly used to solve continuous decision-making problems. Reinforcement learning can learn how to achieve set goals in complex and uncertain environments. As a "reward and punishment" learning algorithm, reinforcement learning does not rely on prior knowledge of the system, nor does it need to establish a precise mathematical model, avoiding the problem of inaccurate model design. It only uses reward and punishment signals given by the environment for training. Algorithm parameters. The basic principle of reinforcement learning is: if the decision-making subject receives a positive reward signal (reinforcement signal) from the environment after performing an action, the tendency of the decision-making subject to perform the current action will be strengthened; conversely, the tendency of the decision-making subject to perform this action will be weakened. The reinforcement learning technology framework is shown in Figure 3. The framework of reinforcement learning mainly consists of five parts: agent, environment, state, action and reward. The agent is the core of reinforcement learning and has the ability to perceive and learn; the state is the parameter form of the agent in the current environment; the action is the execution operation controlled by the agent; the environment is the object explored by the agent; the reward is Evaluation of the agent's decision-making behavior in the current state based on the information provided by the environment. After the agent performs an action, the environment will transition to a new state. For this new state, the environment will give a reward signal (positive or negative reward), update the agent's strategy according to a certain learning rate, and execute the new state. action, and repeat this process until the agent learns the decision-making strategy that can obtain the maximum reward benefit. The above process is the way in which the agent and the environment interact through states, actions, and rewards.

首先，确定TGDF-DRL方法中强化学习模型框架的五个要素：First, identify the five elements of the reinforcement learning model framework in the TGDF-DRL method:

(1) 智能体。采用分布式思想，将路由决策智能体部署在每颗卫星节点内部，卫星节点根据HELLO通知报文机制不断地获取环境状态信息，当卫星节点需要传输数据流时，节点内部Double-DQN网络将状态信息作为输入然后输出该状态下采取各个动作的Q值，节点根据贪婪法选择的动作将数据流传输到下一跳。(1) Intelligent agent. Adopting distributed thinking, the routing decision-making agent is deployed inside each satellite node. The satellite node continuously obtains environmental status information according to the HELLO notification message mechanism. When the satellite node needs to transmit data flow, the Double-DQN network inside the node transfers the status The information is taken as input and then the Q value of each action taken in this state is output. The node transmits the data stream to the next hop according to the action selected by the greedy method.

(2) 路由环境。路由环境即为整个低轨卫星网络拓扑状态几何，当数据流位于卫星节点/>时，数据流/>进行下一跳路由选择所能够感知到的环境信息仅仅来自/>的邻居节点几何/>。(2) Routing environment. The routing environment is the topological state geometry of the entire low-orbit satellite network. When the data flow Located at satellite node/> When, data flow/> The environmental information that can be perceived for next-hop routing selection only comes from/> Neighbor node geometry /> .

(3) 状态空间。在时刻，数据中继节点/>将从路由环境中观测到来自邻居节点的状态信息。由于路由算法的总体目标是尽可能以最小的转发跳数传输数据从而尽可能减小传播时延和排队时延，因此将邻居节点集合/>到目的节点/>的TGDF纳入状态信息，以衡量/>邻居节点到达目的节点的远近，若邻居节点与目的节点的TGDF越小，表明该节点位于/>到目的节点/>之间的最小端到端时延路径的概率越大。然而TGDF只能衡量数据流的潜在传播时延，为了度量邻居节点的队列时延从而更加全面地评估下一跳节点的端到端时延性能，将邻居节点集合/>的队列占用率/>以及队列状态标识/>纳入状态信息。(3) State space. exist time, data relay node/> Status information from neighbor nodes will be observed from the routing environment. Since the overall goal of the routing algorithm is to transmit data with the smallest number of forwarding hops to minimize the propagation delay and queuing delay, neighbor nodes are gathered/> To destination node/> TGDF incorporates status information to measure/> The distance from the neighbor node to the destination node. If the TGDF between the neighbor node and the destination node is smaller, it means that the node is located in/> To destination node/> The greater the probability of the path with minimum end-to-end delay. However, TGDF can only measure the potential propagation delay of the data flow. In order to measure the queue delay of neighbor nodes and more comprehensively evaluate the end-to-end delay performance of the next hop node, neighbor nodes are aggregated/> Queue occupancy/> and queue status identifier/> Include status information.

因此，对于时刻卫星节点集合/>以及目的节点集合/>，状态空间定义为：Therefore, for Time satellite node collection/> And the destination node set/> , the state space is defined as:

； ;

由于路由决策智能体部署在每个低轨卫星上，对于单个卫星节点所能观测到的状态信息为：Since the routing decision agent is deployed on each low-orbit satellite, for a single satellite node The status information that can be observed is:

； ;

卫星节点观测到的状态/>为简洁直观的有限维向量，并且由于状态信息与当前节点/>、邻居节点/>以及目的节点/>自身信息无关，仅仅是用于描述“位于当前节点的数据包到达目的节点”这一任务目标的潜在端到端时延性能，因此网络参数收敛与“当前节点-目的节点”的具体组合无关，路由策略对于全网节点具有良好的收敛性能和泛化性。satellite node Observed state/> is a concise and intuitive limited-dimensional vector, and because the state information is related to the current node/> , neighbor node/> and destination node/> It has nothing to do with its own information. It is only used to describe the potential end-to-end delay performance of the task goal of "the data packet located at the current node reaches the destination node". Therefore, network parameter convergence has nothing to do with the specific combination of "current node-destination node". The routing strategy has good convergence performance and generalization for nodes in the entire network.

(4) 动作空间。在时刻，卫星节点/>需要将数据流/>发送到下一节点，因此动作空间设置为节点/>的相邻建链节点集合/>，即动作空间/>。(4) Action space. exist Time, satellite node/> Data flow/> needs to be is sent to the next node, so the action space is set to node/> The set of adjacent link-building nodes/> , that is, action space/> .

(5) 奖励函数。(5) Reward function.

对卫星路由场景以及强化学习要素建模后，卫星节点内部需要存储深度神经网络DNN学习观测到的环境信息，根据奖励函数规则建立环境信息与下一跳转发决策的相关性。基于Double-DQN的卫星节点路由决策智能体需要存储两个网络结构相同的DNN，其中一个DNN作为主网络MainNet用于实时估计Q值，另一个DNN作为目标网络TargetNet用于计算目标Q值并且给主网络MainNet提供参数更新的目标。对于主网络和目标网络，网络输入均为观测状态，包含相邻节点的TGDF、队列占用率QOR以及队列状态标识QSF，输出为当前状态下各个动作的状态-动作价值/>，本方法的深度神经网络结构如图4所示。After modeling the satellite routing scenario and reinforcement learning elements, the satellite node needs to store the environmental information observed by the deep neural network DNN learning, and establish the correlation between the environmental information and the next-hop forwarding decision based on the reward function rules. The satellite node routing decision-making agent based on Double-DQN needs to store two DNNs with the same network structure. One DNN is used as the main network MainNet for real-time estimation of the Q value, and the other DNN is used as the target network TargetNet for calculating the target Q value and giving MainNet provides the target for parameter updates. For both the main network and the target network, the network inputs are observation states , including the TGDF of adjacent nodes, queue occupancy QOR and queue status identifier QSF, and the output is the current status The status of each action - action value/> , The deep neural network structure of this method is shown in Figure 4.

路由觉得智能体完成训练后，即可通过输入的观测状态输出下一跳转发节点，并将数据包进行路由转发。在时刻，卫星节点/>首先根据解析的HELLO报文获得邻居节点/>的地理坐标、拓扑坐标、队列占用率和队列状态标识，对于要发送的数据流/>，解析对应目的节点/>的地理坐标和拓扑坐标，然后节点/>内部处理器计算出/>，并且将、/>和/>作为路由决策智能体的输入，路由决策智能体输出当前观测下选择下一跳节点对应的状态-动作价值函数Q值，节点/>根据/>贪婪法选择将数据流/>发送到下一跳转发节点，从而完成了当前时刻的数据转发任务。After the routing agent completes training, it can output the next-hop forwarding node through the input observation status, and route and forward the data packet. exist Time, satellite node/> First, obtain neighbor nodes based on the parsed HELLO message/> geographical coordinates, topological coordinates, queue occupancy and queue status identification, for the data flow to be sent/> , parse the corresponding destination node/> geographical coordinates and topological coordinates, then node/> The internal processor calculates/> , and will ,/> and/> As the input of the routing decision-making agent, the routing decision-making agent outputs the state-action value function Q value corresponding to the selected next-hop node under the current observation, node/> According to/> The greedy method selects the data stream/> Sent to the next hop forwarding node, thus completing the data forwarding task at the current moment.

如图5所示，提供本方法的系统实现架构。其中TTL是 Time To Live的缩写，该字段指定IP包被路由器丢弃之前允许通过的最大网段数量。本方法由深度强化学习智能体作为路由决策核心，路由决策智能体通过感知邻居节点的数据缓存队列负载情况，结合邻居节点与目的节点之间的靠近程度给出下一跳转发节点，从而使得通信数据流在路由转发过程中既能够在低负载情况下按照最短传播距离的路径进行传播，又能够及时绕开高负载的节点，从而尽可能减小节点拥塞情况，实现数据流量传输的负载均衡，减小网络丢包率并提高吞吐量。As shown in Figure 5, the system implementation architecture of this method is provided. TTL is the abbreviation of Time To Live. This field specifies the maximum number of network segments that an IP packet is allowed to pass before being discarded by the router. This method uses a deep reinforcement learning agent as the core of routing decision-making. The routing decision-making agent senses the data cache queue load of neighbor nodes and gives the next-hop forwarding node based on the proximity between the neighbor node and the destination node, so that During the routing and forwarding process, communication data flows can not only propagate along the shortest propagation distance path under low load conditions, but also bypass high-load nodes in time, thereby minimizing node congestion and achieving load balancing of data traffic transmission. , reduce network packet loss rate and improve throughput.

应该理解的是，虽然图1的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，图1中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although various steps in the flowchart of FIG. 1 are shown in sequence as indicated by arrows, these steps are not necessarily executed in the order indicated by arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in Figure 1 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily sequential, but may be performed in turn or alternately with other steps or sub-steps of other steps or at least part of the stages.

在一个实施例中，提供了一种基于深度强化学习的分布式负载均衡卫星路由装置，包括：拓扑快照获取模块、距离因子计算模块、状态-动作价值计算模块和数据转发模块，其中：In one embodiment, a distributed load balancing satellite routing device based on deep reinforcement learning is provided, including: a topology snapshot acquisition module, a distance factor calculation module, a state-action value calculation module and a data forwarding module, wherein:

拓扑快照获取模块，用于根据当前时隙的低轨卫星网络拓扑快照得到当前的星间链路；在星间链路中包含多个卫星节点；每一卫星节点对应一个路由决策智能体；The topology snapshot acquisition module is used to obtain the current inter-satellite link based on the low-orbit satellite network topology snapshot of the current time slot; the inter-satellite link contains multiple satellite nodes; each satellite node corresponds to a routing decision-making agent;

距离因子计算模块，用于在当前时隙下的星间链路中，根据当前卫星节点的邻居卫星节点和目标卫星节点的空间坐标计算得到邻居卫星节点和目标卫星节点的地理距离因子，根据当前卫星节点的邻居卫星节点和目标卫星节点的虚拟拓扑地址计算得到邻居卫星节点和目标卫星节点的拓扑距离因子，根据地理距离因子和对应的拓扑距离因子得到拓扑-地理距离因子；其中，虚拟拓扑地址包括卫星节点的所在轨道编号及其轨道内顺序编号；The distance factor calculation module is used to calculate the geographical distance factor of the neighbor satellite node and the target satellite node according to the spatial coordinates of the neighbor satellite node and the target satellite node of the current satellite node in the inter-satellite link under the current time slot. The virtual topological address of the satellite node's neighbor satellite node and the target satellite node is calculated to obtain the topological distance factor of the neighbor satellite node and the target satellite node. The topological-geographical distance factor is obtained according to the geographical distance factor and the corresponding topological distance factor; where, the virtual topological address Including the orbit number of the satellite node and its sequence number within the orbit;

状态-动作价值计算模块，用于将每一邻居卫星节点的队列状态标识符、队列占用率及其与目标卫星节点的拓扑-地理距离因子作为当前时隙下的观测状态信息并输入当前卫星节点对应的路由决策智能体的深度神经网络中，输出当前卫星节点将数据流转发至各个邻居卫星节点的状态-动作价值；The state-action value calculation module is used to use the queue status identifier of each neighbor satellite node, the queue occupancy rate, and the topology-geographic distance factor with the target satellite node as the observation state information under the current time slot and input it to the current satellite node In the corresponding deep neural network of the routing decision-making agent, the status-action value of the current satellite node forwarding the data stream to each neighbor satellite node is output;

数据转发模块，用于选取最大状态-动作价值对应的邻居卫星节点作为当前卫星节点的下一跳转发节点，完成当前时隙的数据转发任务。The data forwarding module is used to select the neighbor satellite node corresponding to the maximum state-action value as the next-hop forwarding node of the current satellite node to complete the data forwarding task of the current time slot.

关于基于深度强化学习的分布式负载均衡卫星路由装置的具体限定可以参见上文中对于基于深度强化学习的分布式负载均衡卫星路由方法的限定，在此不再赘述。上述分布式负载均衡卫星路由装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。For specific limitations on the distributed load balancing satellite routing device based on deep reinforcement learning, please refer to the limitations on the distributed load balancing satellite routing method based on deep reinforcement learning mentioned above, which will not be described again here. Each module in the above-mentioned distributed load balancing satellite routing device can be realized in whole or in part by software, hardware and combinations thereof. Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储卫星节点位置和队列负载情况等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种分布式负载均衡卫星路由方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 6 . The computer device includes a processor, memory, network interface, and database connected through a system bus. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage medium stores operating systems, computer programs and databases. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media. The computer device's database is used to store data such as satellite node locations and queue load conditions. The network interface of the computer device is used to communicate with external terminals through a network connection. The computer program when executed by the processor implements a distributed load balancing satellite routing method.

本领域技术人员可以理解，图6中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 6 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Specific computer equipment can May include more or fewer parts than shown, or combine certain parts, or have a different arrangement of parts.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，该存储器存储有计算机程序，该处理器执行计算机程序时实现上述实施例中方法的步骤。In one embodiment, a computer device is provided, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, the steps of the method in the above embodiment are implemented.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述实施例中方法的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the method in the above embodiment are implemented.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器（ROM）、可编程ROM（PROM）、电可编程ROM（EPROM）、电可擦除可编程ROM（EEPROM）或闪存。易失性存储器可包括随机存取存储器（RAM）或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM（SRAM）、动态RAM（DRAM）、同步DRAM（SDRAM）、双数据率SDRAM（DDRSDRAM）、增强型SDRAM（ESDRAM）、同步链路（Synchlink） DRAM（SLDRAM）、存储器总线（Rambus）直接RAM（RDRAM）、直接存储器总线动态RAM（DRDRAM）、以及存储器总线动态RAM（RDRAM）等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage. In the media, when executed, the computer program may include the processes of the above method embodiments. Any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A distributed load balancing satellite routing method based on deep reinforcement learning, characterized in that the method includes:

The current inter-satellite link is obtained based on the low-orbit satellite network topology snapshot of the current time slot; the inter-satellite link contains multiple satellite nodes; each satellite node corresponds to a routing decision-making agent;

In the inter-satellite link under the current time slot, the geographical distance factor of the neighbor satellite node and the target satellite node is calculated based on the spatial coordinates of the current satellite node's neighbor satellite node and the target satellite node. According to the current satellite node's neighbor satellite node and The virtual topological address of the target satellite node is calculated to obtain the topological distance factor of the neighboring satellite node and the target satellite node, and the topological-geographical distance factor is obtained according to the geographical distance factor and the corresponding topological distance factor; among which, the virtual topological address includes the orbit number of the satellite node. and its sequence number within the track;

The queue status identifier, queue occupancy rate and topology-geographic distance factor of each neighbor satellite node and the target satellite node are used as the observation status information under the current time slot and input into the deep neural network of the routing decision agent corresponding to the current satellite node. In the network, output the status-action value of the current satellite node forwarding the data stream to each neighbor satellite node;

Select the neighbor satellite node corresponding to the maximum state-action value as the next-hop forwarding node of the current satellite node to complete the data forwarding task of the current time slot.

2. The method according to claim 1, characterized in that the geographical distance factor of the neighbor satellite node and the target satellite node is calculated according to the spatial coordinates of the neighbor satellite node and the target satellite node of the current satellite node, including:

According to the spatial coordinates of the neighbor satellite node and the target satellite node of the current satellite node, the geographical distance factor of the neighbor satellite node and the target satellite node is calculated as:

;

in, Indicates time slot/> The next page/> neighbor satellite nodes of satellite nodes,/> Represents the target satellite node, Indicates time slot/> The next page/> The spatial coordinates of neighbor satellite nodes of satellite nodes, Indicates the spatial coordinates of the target satellite node,/> Indicates time slot/> The next page/> The geographical distance factor between the neighbor satellite node of a satellite node and the target satellite node,/> represents the speed of light.

3. The method according to claim 2, characterized in that the topological distance factor of the neighbor satellite node and the target satellite node is calculated according to the virtual topological addresses of the neighbor satellite node and the target satellite node of the current satellite node, including:

According to the virtual topological addresses of the neighbor satellite nodes and the target satellite node of the current satellite node, the topological distance factor of the neighbor satellite node and the target satellite node is calculated as:

;

in, Indicates time slot/> The next page/> The orbit number of the satellite node's neighbor satellite node,/> Indicates time slot/> The next page/> The orbital sequence number of the neighbor satellite node of a satellite node,/> Indicates the orbit number of the target satellite node,/> Represents the sequence number within the orbit of the target satellite node,/> Indicates time slot/> The next page/> The topological distance factor between the neighbor satellite node of a satellite node and the target satellite node,/> Indicates the influence factor of orbital surface proximity on topological distance factor,/> Represents the influence factor of the sequential proximity within the orbit on the topological distance factor,/> Indicates the number of satellites in each orbit.

4. The method according to claim 3, characterized in that, obtaining the topological-geographical distance factor according to the geographical distance factor and the corresponding topological distance factor includes:

According to the geographical distance factor and the corresponding topological distance factor, the topological-geographical distance factor is:

;

in, Indicates time slot/> The next page/> The topology-geographic distance factor of the neighbor satellite node and the target satellite node of the satellite node,/> Represents the influence factor of topological distance factor on topological-geographical distance factor,/> Indicates the influence factor of geographical distance factor on topological-geographic distance factor.

5. The method according to claim 4, wherein the step of constructing the deep neural network of the routing decision agent includes:

Construct a first deep neural network and a second deep neural network; the first deep neural network and the second deep neural network have the same network structure; wherein the first deep neural network serves as the main network and is used to estimate the state in real time - Action value; the second deep neural network serves as the target network and is used to calculate the target state-action value and provide it to the first deep neural network for updating network parameters.

6. The method according to claim 5, characterized in that the queue status identifier, queue occupancy rate and topology-geographic distance factor of each neighbor satellite node and the target satellite node are used as the observation state under the current time slot. The information is input into the deep neural network of the routing decision agent corresponding to the current satellite node, and the status-action value of the current satellite node forwarding the data stream to each neighbor satellite node is output, including:

The queue status identifier, queue occupancy rate and topology-geographic distance factor of each neighbor satellite node and the target satellite node are used as the observation status information under the current time slot and input into the deep neural network of the routing decision agent corresponding to the current satellite node. in the network;

Get a pre-built reward function; the expression of the reward function is:

;

in, Indicates time slot/> The next page/> Reward function of satellite nodes,/> Indicates time slot/> The next page/> The queue status identifier of the neighbor satellite node of the satellite node,/> Indicates time slot/> The next page/> Queue occupancy rate of neighbor satellite nodes of satellite nodes;

According to the observation status information and the reward function, the status-action value of the current satellite node forwarding the data flow to each neighbor satellite node is obtained.

7. The method according to claim 1, characterized in that, the method further comprises:

The HELLO/ACK detection response message mechanism is used to exchange status information between satellite nodes through inter-satellite links; the status information includes: the IP address of the satellite node, virtual topology address, spatial coordinates, queue occupancy rate, and timestamp.