WO2019192361A1

WO2019192361A1 - Congestion control in network communications

Info

Publication number: WO2019192361A1
Application number: PCT/CN2019/079786
Authority: WO
Inventors: Hui ZANG; Yiming Kong
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2018-04-06
Filing date: 2019-03-27
Publication date: 2019-10-10
Also published as: CN111919423A; CN111919423B

Abstract

In certain embodiments, a method for Transmission Control Protocol (TCP) congestion control includes determining, by a communication device and based on a first signal received from a communication network, a first network state. The method further includes determining, by a loss predictor of the communication device and according to the first network state, a first loss prediction for a first packet. The first loss prediction indicates a likelihood that the first packet will be lost due to network congestion if the first packet is transmitted via the communication network. The method further includes determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network.

Description

CONGESTION CONTROL IN NETWORK COMMUNICATIONS

RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 62/654,023, filed on April 6, 2018, and entitled “TCP Congestion Control Based on Packet Loss Prediction, ” and claims the benefit of U.S. Provisional Application No. 62/810,134, filed on February 25, 2019, and entitled “Congestion Control in Network Communications, ” which application is incorporated herein by reference.

TECHNICAL FIELD

This disclosure relates generally to network communication and, in particular embodiments, to congestion control in network communications.

BACKGROUND

Network communication protocols define the way in which electronic devices communicate with one another. Transmission Control Protocol (TCP) , which is an example network communication protocol, is a connection-oriented, reliable, and byte stream-based transport layer communication protocol. TCP, for example, works with Internet Protocol (IP) networks.

An issue that might arise in communication networks, such as TCP/IP networks, is congestion control. Congestion might occur when the number of data packets sent into a network by users exceeds the processing capabilities of the network. Some network communication protocols, such as TCP, employ techniques to attempt to perform congestion control. A congestion control protocol is a scheme that, at least in part, decides when to send new data packets, retransmit data packets, and send acknowledgements (ACKs) . Congestion control algorithms might facilitate alieving congestion and promoting the efficiency and fairness of network use. The general functionality of the TCP congestion control protocols include resource tracking, congestion avoidance, and loss recovery. Congestion signals might be derived from ACKs from the receivers of packets and might indicate packet loss and or longer round trip delays.

SUMMARY

According to a first aspect of this disclosure, a method for Transmission Control Protocol (TCP) congestion control includes determining, by a communication device and based on a first signal received from a communication network, a first network state. The method further includes determining, by a loss predictor of the communication device and according to the first network state, a first loss prediction for a first packet. The first loss prediction indicates a likelihood that the first packet will be lost due to network congestion if the first packet is transmitted via the communication network. The method further includes determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network.

Optionally, in any of the preceding aspects of the first aspect, the method further includes determining, based on the first loss prediction, whether to increase or decrease a value of a congestion window variable. The value of the congestion window variable indicates a size of a congestion window.

Optionally, in any of the preceding aspects of the first aspect, the first loss prediction includes a first loss probability indicating an estimate of a probability that the first packet will be lost if the first packet is transmitted via the communication network.

Optionally, in any of the preceding aspects of the first aspect, determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network includes determining that the first loss probability does not exceed a loss probability threshold and determining, based at least on determining that the first loss probability does not exceed the loss probability threshold, to transmit the first packet via the communication network.

Optionally, in any of the preceding aspects of the first aspect, the method further includes transmitting, in response to determining to transmit the first packet via the communication network, the first packet via the communication network.

Optionally, in any of the preceding aspects of the first aspect, the method further includes determining, by the communication device and based on a second signal received from the communication network, a second network state; determining, by the communication device and according to the second network state, a second loss prediction for a second packet, the second loss prediction indicating a likelihood that the second packet will be lost if the second packet is transmitted via the communication network; and determining, by the communication device and based at least on the second loss prediction, not to transmit the second packet via the communication network.

Optionally, in any of the preceding aspects of the first aspect, the second loss prediction includes a second loss probability indicating an estimate of a probability that the second packet will be lost if the second packet is transmitted via the communication network. Additionally, determining, by the communication device and based at least on the second loss prediction, not to transmit the second packet via the communication network includes determining that the second loss probability exceeds a loss probability threshold; and determining, based at least on determining that the second loss probability exceeds the loss probability threshold, not to transmit the second packet via the communication network.

Optionally, in any of the preceding aspects of the first aspect, the method further includes decreasing, based at least on determining that the second loss probability exceeds the loss probability threshold, a value of a congestion window variable, the value of the congestion window variable indicating a size of a congestion window.

Optionally, in any of the preceding aspects of the first aspect, the method further includes performing supervised training of the loss predictor.

Optionally, in any of the preceding aspects of the first aspect, performing supervised training of the loss predictor includes collecting, through simulations performed on a network simulator, training data for training the loss predictor; training a model according to the training data; and correlating network states to respective probabilities of loss according to the trained model.

Optionally, in any of the preceding aspects of the first aspect, the first signal is an acknowledgement (ACK) received by the communication device in response to a previous packet transmission by the communication device.

Optionally, in any of the preceding aspects of the first aspect, the first network state is represented as a state vector representing the congested condition of the network.

Optionally, in any of the preceding aspects of the first aspect, the state vector includes values for an exponentially weighted moving average (EWMA) of acknowledgement (ACK) inter-arrival time, the first signal being an ACK; an EWMA of packet inter-sending time; a ratio of current round-trip time (RTT) and a minimum RTT; a slow start threshold; a congestion window size; or a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the congestion window size.

According to a second aspect of this disclosure, a system for Transmission Control Protocol (TCP) congestion control includes a non-transitory memory storage including instructions and one or more processors in communication with the memory storage. The one or more processors are configured to execute the instructions to perform the method in any of the preceding aspects of the first aspect.

According to a third aspect of this disclosure, a non-transitory computer-readable media stores computer instructions for Transmission Control Protocol (TCP) congestion control, that when executed by one or more processors, cause the one or more processors to perform the method in any of the preceding aspects of the first aspect.

According to a fourth aspect of this disclosure, a system for Transmission Control Protocol (TCP) congestion control includes means for determining, by a communication device and based on a first signal received from a communication network, a first network state. The system further includes means for determining, by a loss predictor of the communication device and according to the first network state, a first loss prediction for a first packet. The first loss prediction indicates a likelihood that the first packet will be lost due to network congestion if the first packet is transmitted via the communication network. The system further includes means for determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network.

According to a fifth aspect of this disclosure, a method for Transmission Control Protocol (TCP) congestion control includes determining, by a communication device and based on a first utility value and a second utility value, a reward for a first action. The first utility value is determined using a utility function and corresponds to a first time period from a first time to a second time. The second utility value is determined using the utility function and corresponds to a second time period from the second time to a third time. The first action corresponds to the first time and is one of a plurality of actions. Each action of the plurality of actions includes modifying a value of a congestion window variable for a communication network by a respective amount. The value of the congestion window variable represents a size of the congestion window. The method further includes updating, by the communication device, a first value function indicating a first desirability value associated with a first network state and the first action. The first value function is updated according to the reward and a second value function indicating a second desirability value associated with a second network state and a second action of the plurality of actions. The first network state corresponds to the first time, and the second network state and the second action correspond to the second time. The method further includes determining, by the communication device and according to the updated first value function, a third action of the plurality of actions. The method further includes updating the congestion window variable based on the third action.

Optionally, in any of the preceding aspects of the fifth aspect, a difference between the second time and the first time is at least a round trip time of a first packet in the communication network, and a difference between the third time and the second time is at least a round trip time of a second packet in the communication network.

Optionally, in any of the preceding aspects of the fifth aspect, determining, by the communication device and based on the first utility value and the second utility value, the reward is performed based at least in part on determining that a difference between the third time and the second time is greater than or equal to a round trip time.

Optionally, in any of the preceding aspects of the fifth aspect, the method further includes updating the congestion window variable at least twice prior to updating the congestion window variable based on the third action.

Optionally, in any of the preceding aspects of the fifth aspect, one of the following is true: the first action is the same as the second action; the first action is the same as the third action; the second action is the same as the third action; and the first action, the second action, and the third action are the same.

Optionally, in any of the preceding aspects of the fifth aspect, one of the following is true: the first action is different than the second action; the first action is different than the third action; the second action is different than the third action; and the first action, the second action, and the third action are different.

Optionally, in any of the preceding aspects of the fifth aspect, the first, second, and third network states are each represented as respective state vectors representing a congested condition of the communication network.

Optionally, in any of the preceding aspects of the fifth aspect, each state vector comprises values for an exponentially weighted moving average (EWMA) of acknowledgement (ACK) inter-arrival time; an EWMA of packet inter-sending time; a ratio of current round-trip time (RTT) and a minimum RTT; a slow start threshold; a current value of the congestion window variable; or a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the current value of the congestion window variable.

Optionally, in any of the preceding aspects of the fifth aspect, the second utility value, as determined using the utility function, is a function of throughput, delay, and packet loss rate.

Optionally, in any of the preceding aspects of the fifth aspect, the utility function is:

wherein tp is throughput, B is a bandwidth of a bottleneck in the communication network, d is a delay calculated as a difference between a current round-trip time (RTT) and a minimum RTT, p is a packet loss rate, and δ ₁ and δ ₂ are adjustable coefficients.

Optionally, in any of the preceding aspects of the fifth aspect, for an action of the plurality of actions, the respective amount for modifying the congestion window variable is -1, 0, +1, or +3.

Optionally, in any of the preceding aspects of the fifth aspect, the value function is a Q-function that is trained using a State-Action-Reward-State-Action (SARSA) temporal difference learning algorithm.

Optionally, in any of the preceding aspects of the fifth aspect, the first value function is updated as follows:

wherein

means y ₁ = (1-α) y ₁+αy ₂; s _i, a _i, r _i are state, action, and reward variables for storing respective values computed at the beginning of a time period i; α _t is the learning rate as a function of time t (e.g., in seconds) ; and γ is a discount factor; and wherein n-1 is the first time, n is the second time; and n+1 is the third time.

Optionally, in any of the preceding aspects of the fifth aspect, the following are true:

r _n+1 is computed as follows:

Δ _n+1=U _n+1-U _n; and

U _n+1 is the second utility value and U _n is the first utility value.

Optionally, in any of the preceding aspects of the fifth aspect, the method further includes determining, at the first time, the first network state; and determining, at the second time, the second network state.

Optionally, in any of the preceding aspects of the fifth aspect, the method further includes determining, in response to receiving a transmission signal, a third network state. The signal is an acknowledgement (ACK) received by the communication device in response to a previous packet transmission by the communication device.

Optionally, in any of the preceding aspects of the fifth aspect, the third action is further determined according to an ∈-greedy exploration-exploitation (E2) scheme.

According to a sixth aspect of this disclosure, a system for Transmission Control Protocol (TCP) congestion control includes a non-transitory memory storage including instructions and one or more processors in communication with the memory storage. The one or more processors are configured to execute the instructions to perform the method in any of the preceding aspects of the fifth aspect.

According to a seventh aspect of this disclosure, a non-transitory computer-readable media stores computer instructions for Transmission Control Protocol (TCP) congestion control, that when executed by one or more processors, cause the one or more processors to perform the method in any of the preceding aspects of the fifth aspect.

According to a fourth aspect of this disclosure, a system for Transmission Control Protocol (TCP) congestion control includes means for determining, by a communication device and based on a first utility value and a second utility value, a reward for a first action. The first utility value is determined using a utility function and corresponds to a first time period from a first time to a second time. The second utility value is determined using the utility function and corresponds to a second time period from the second time to a third time. The first action corresponds to the first time and is one of a plurality of actions. Each action of the plurality of actions includes modifying a value of a congestion window variable for a communication network by a respective amount. The value of the congestion window variable represents a size of the congestion window. The system further includes means for updating, by the communication device, a first value function indicating a first desirability value associated with a first network state and the first action. The first value function is updated according to the reward and a second value function indicating a second desirability value associated with a second network state and a second action of the plurality of actions. The first network state corresponds to the first time, and the second network state and the second action correspond to the second time. The system further includes means for determining, by the communication device and according to the updated first value function, a third action of the plurality of actions. The system further includes means updating the congestion window variable based on the third action.

In certain embodiments, a loss predictor-based TCP (LP-TCP) congestion control mechanism predicts and reduces packet loss events, lowers the frequency of sending rate reduction, and strives for an improved throughput relative to conventional congestion control techniques. In certain embodiments, a loss predictor-based TCP congestion control mechanism works particularly well when the network model remains more or less fixed, although this disclosure contemplates the loss predictor-based TCP congestion control mechanism performing well in any suitable environment.

In certain embodiments, a reinforcement learning-based TCP (RL-TCP) congestion control mechanism has an objective to improve a function of throughput, delay, and packet loss rate. In certain embodiments, a reinforcement learning-based TCP congestion control mechanism provides a suitable tradeoff between throughput and delay. In certain embodiments, a reinforcement learning-based TCP congestion control mechanism works particularly well in environments in which the topology and or other parameters of a network change, although this disclosure contemplates the reinforcement learning-based TCP congestion control mechanism performing well in any suitable environment.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of this disclosure, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIGURE 1 illustrates an example system for learning-based TCP congestion control, according to certain embodiments of this disclosure;

FIGURE 2 is a graph showing an F1 score of the LP-TCP congestion control scheme as a function of the decision threshold th, for various network setups, according to certain embodiments of this disclosure;

FIGURE 3 illustrates an example method for LP-TCP congestion control, according to certain embodiments of this disclosure;

FIGURE 4 illustrates an example method for a training learner engine of a communication device for LP-TCP congestion control, according to certain embodiments of this disclosure;

FIGURE 5 illustrates an example method for RL-TCP congestion control, according to certain embodiments of this disclosure;

FIGURE 6 illustrates an example dumbbell topology, according to certain embodiments of this disclosure;

FIGURE 7 illustrates the changes of congestion window size for the Q-learning based TCP (Q-TCP) , Q-TCP with credit assignment (Q-TCPca) , RL-TCP with no credit assignment (RL-TCPno-ca) , and RL-TCP congestion control schemes during one simulation, according to certain embodiments of this disclosure;

FIGURE 8 is a block diagram of an embodiment processing system for performing methods described herein, which may be installed in a host device, according to certain embodiments of this disclosure; and

FIGURE 9 is a block diagram of a transceiver adapted to transmit and receive signaling over a telecommunication network, according to certain embodiments of this disclosure.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

As described above, in a network, such as a TCP/IP network, one mechanism for promoting efficient and fair sharing of network resources among network users is a congestion control scheme, such as a TCP congestion control scheme. Some TCP congestion control techniques hard-wire predefined actions on a congestion window (cwnd) to specific feedback signals such as packet loss and/or round-trip time (RTT) , based on assumptions about the network. Reactions to signals indicating congestion might include increasing or decreasing the cwnd size. For example, loss-based TCPs might reduce the cwnd by half and retransmit a lost packet on three duplicated ACKs. As networks become more complex, however, it becomes more difficult to determine the optimal feedback/action mapping.

Some TCP congestion control techniques, such as NewReno and Vegas, were designed for the Ethernet and later suffered from issues such as “high-bandwidth” and “lossy-link” as networks evolved. Other TCP congestion control schemes include CUBIC and Compound. CUBIC increases its cwnd according to a cubic function, which may facilitate recovery of cwnd after cwnd is decreased in response to a loss event and expansion of cwnd when an increase in network ceiling is detected. Compound reacts to signal delay in addition to packet loss events, and adopts a scalable cwnd increasing rule in response to changes in round-trip time. Adaptive congestion control was designed specifically for unpredictable cellular networks, and builds a delay profile and uses delay measurements to react to changes of capacity in a cellular network.

Embodiments of this disclosure provide techniques for learning-based congestion control. For example, embodiments of this disclosure provide a loss predictor-based TCP congestion control mechanism. As another example, embodiments of this disclosure provide a reinforcement learning-based TCP congestion control mechanism. Throughout this description and accompanying figures, congestion control may be abbreviated as CC, the loss predictor-based TCP congestion control mechanism may be abbreviated as LP-TCP congestion control (or LP-TCP CC) mechanism, and the reinforcement learning-based TCP congestion control mechanism may be abbreviated as RL-TCP congestion control (or RL-TCP CC) mechanism.

In the context of TCP congestion control, machine learning has been used to classify congested and non-congested loss, and it has also been used to purportedly improve round-trip time estimation. One proposed congestion control protocol formalized the multi-user congestion control problem as a partially observable Markov decision process (POMDP) and learned for the optimum policy offline, which might involve intense offline computation and the performance of which might closely depend on the accuracy of the network and traffic models. Another proposed congestion control protocol adaptively adjusts the sending rate based on continuously executed “micro-experiments, ” but this proposal is entirely rate-based and its performance closely depends on the accuracy of the clocking. Another proposed congestion control protocol attempts to use Q-learning to design the TCP congestion control protocol, but the choices of reward, exploitation-exploration (E2) schemes, and states are relatively basic, and it has demonstrated results only for a single sender or two senders.

One problem with hard-wired and network model specific TCP congestion control designs is that such designs might need to be refined or even redesigned should new network technologies arise, which is likely to be the case with modern computer networks. Another problem is that such designs are generally mechanism-driven instead of objective-driven. In addition, loss-based TCP congestion control protocols, such as TCP Tahoe, Reno, NewReno, CUBIC, and Compound, reduce the cwnd when a packet is lost and slowly find a new network ceiling. This process might incur bandwidth underutilization. Some learning-based TCPs find the optimal changes to the cwnd based on offline optimization, which bears a high complexity cost. Q-learning TCP might take a long online training period in order to learn the appropriate Q function.

To address none, some, or all of the above problems, embodiments of this disclosure provide learning-based TCP congestion control protocols for wired networks using supervised and reinforcement learning.

Based on the observations that packet loss causes a decrease in cwnd and reduces the throughput, certain embodiments provide a loss predictor-based TCP congestion control that aims to minimize the packet loss rate for each sender. Supervised learning may be used to train a packet LP from interactions with the network and may be used to predict whether packet loss will happen before sending each packet. If loss is predicted accurately, reducing cwnd may be avoided and thus flow throughput may be improved.

In certain embodiments, a method for TCP congestion control, which incorporates an LP-TCP CC mechanism, in a communication network includes determining, by a communication device and based on a signal received from a network, a network state. The method further includes determining, by a loss predictor of the communication device and according to the network state, a loss prediction for a packet. The loss prediction indicates a likelihood that the packet will be lost due to network congestion if the packet is transmitted via the network. The method further includes determining, by the communication device and based at least on the loss prediction, to transmit the packet via the network.

Prediction of packet loss may allow congestion to be predicted and proactive avoidance of congestion. That is, embodiments of the disclosed TCP congestion control protocol is based on the idea of “predicting congestion. ” Learning from experience, the disclosed TCP agent evaluates the “probability of loss” before sending the current packets. The result is a TCP congestion control agent that avoids running over a congestion cliff as much as possible, improves bottleneck throughput, and reduces the times of fast recovery.

Embodiments of the disclosed TCP congestion control protocols might be implemented as modified versions of TCP NewReno in a Network Simulator 2 (NS2) , and the throughput, delay, and packet loss rate (PLR) of embodiments of the disclosed TCP congestion control protocols can be compared with those of TCP NewReno on a simulated wired network. With proper training, RL-TCP congestion control shows superior performance compared to TCP NewReno, while LP-TCP outperforms NewReno on average throughput and PLR.

In certain embodiments, a loss predictor-based TCP congestion control mechanism predicts and reduces packet loss events, lowers the frequency of sending rate reduction, and strives for an improved throughput relative to conventional congestion control techniques. In certain embodiments, a loss predictor-based TCP congestion control mechanism works particularly well when the network model remains more or less fixed, although this disclosure contemplates the loss predictor-based TCP congestion control mechanism performing well in any suitable environment.

When the topology and parameters of a network change, a new packet LP might need to be trained again. Therefore, another embodiment provides a design for a reinforcement learning-based TCP congestion control with carefully designed rewards, E2 schemes, and states, to optimize the average throughput delay ratio of a sender. In these embodiments, packet loss may be treated as a signal for congestion.

In certain embodiments, a method for TCP congestion control, which may incorporate an RL-TCP CC mechanism, in a communication network includes determining, by a communication device and based on a first utility value and a second utility value, a reward for a first action. The first utility value is determined using a utility function and corresponds to a first time period from a first time to a second time. The second utility value is determined using the utility function and corresponds to a second time period from the second time to a third time. The first action corresponds to the first time and is one of a plurality of actions, each action of the plurality of actions comprising modifying a value of a congestion window variable for a communication network by a respective amount, the value of the congestion window variable representing a size of the congestion window. The method further includes updating, by the communication device, a first value function indicating a first desirability value associated with a first network state and the first action. The first value function is updated according to the reward and a second value function indicating a second desirability value associated with a second network state and a second action of the plurality of actions. The first network state corresponds to the first time, and the second network state and the second action corresponding to the second time. The method further includes determining, by the communication device and according to the updated first value function, a third action of the plurality of actions. The method further includes updating the congestion window variable based on the third action.

In certain embodiments, a reinforcement learning-based TCP congestion control mechanism has an objective to improve a function of throughput, delay, and packet loss rate. In certain embodiments, a reinforcement learning-based TCP congestion control mechanism provides a suitable tradeoff between throughput and delay. In certain embodiments, a reinforcement learning-based TCP congestion control mechanism works particularly well in environments in which the topology and or other parameters of a network change, although this disclosure contemplates the reinforcement learning-based TCP congestion control mechanism performing well in any suitable environment.

The structure, manufacture, and use of certain embodiments are described in detail below. It should be appreciated, however, that this disclosure provides many applicable novel concepts that can be embodied in a wide variety of specific contexts. The specific embodiments described are merely illustrative of specific ways to make and use the embodiments, and do not limit the scope of the disclosure.

For example, the techniques of this disclosure will be described in the context of the transmission control protocol (TCP) . However, it should be noted that the techniques of this disclosure can be applied to different variations and flavors of TCP as well as to alternatives to TCP and other network protocols that have a congestion control component. In the following description, numerous specific details are set forth in order to provide a thorough understanding of this disclosure. This disclosure may be practiced without some or all of these specific details. In other instances, well known process operations have not been described in detail in order not to unnecessarily obscure this disclosure.

FIGURE 1 illustrates an example system 100 for learning-based TCP congestion control, according to certain embodiments of this disclosure. System 100 includes a communication device 102 and a network 104.

Communication device 102 may be (or may be part of) any processing device that is able to send communications over a communication network. As particular examples, communication device 102 may be (or may be part of) an end user device, such as a landline, a mobile telephone, a smartphone, a desktop or laptop computer, a tablet computer, a television, or any other suitable type of electronic end user device. As yet further examples, communication device 102 may be (or may be part of) network equipment, such as a switch, a router, a gateway, a base station, or any other suitable type of electronic network equipment. In certain embodiments, all or a portion of communication device 100 may be referred to as a TCP CC agent or simply an agent.

In the illustrated example, communication device 102 includes a processor 106, a memory 108, a sensing engine 110, a learner engine 112, an actuator engine 114, and a sending engine 116. Each of these example components are described below.

Processor 106 includes any combination of hardware, firmware, and software that operates to control and process information. Processor 106 may be a programmable logic device, a central processing unit, a microcontroller, a microprocessor, a digital signal processor, a field programmable gate array, an application specific integrated circuit, any processing device, or any combination of the preceding. Processor 106 may be configured to read and process instructions stored in memory 108. Although illustrated as a single functional unit, this disclosure contemplates mobile device including any suitable number of processors.

Memory 108 stores, either permanently or temporarily, data, operational instructions (e.g., software) , or other information for access and/or execution by processor 106. Memory 108 includes any one or a combination of volatile or non-volatile local or remote devices for storing information. For example, memory 108 may include static or dynamic random access memory (RAM) , read-only memory (ROM) , magnetic storage devices, optical storage devices, hard disks, subscriber identity module (SIM) cards, memory sticks, secure digital (SD) memory cards, or any other information storage device or a combination of these devices. In certain embodiments, at least a portion of memory 108 is non-transitory. Although a single memory 108 is illustrated, communication device 102 may include any number of memories 108. Among other potential information, memory 108 stores programming for execution by the processor 106 to cause processor 106 to perform operations associated with communication device 102.

Communication device 102 may communicate with other devices (e.g., other communication devices 102) via network 104. Network 104 facilitates wireless or wireline communication. Network 104 may communicate, for example, Internet protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, and other suitable information between network addresses. Network 104 may include one or more local area networks (LANs) , radio access networks (RANs) , metropolitan area networks (MANs) , wide area networks (WANs) , mobile networks (e.g., using BLUETOOTH, WiMax (802.16) , Wi-Fi (802.11) , 3G, 4G, 5G, Long-Term Evolution (LTE) , 5G New Radio (NR) , or any other suitable wireless technologies in any suitable combination) , all or a portion of the global computer network known as the Internet, and/or any other communication system or systems at one or more locations, any of which may be any suitable combination of wireless and wireline.

Sensing engine 110, learner engine 112, actuator engine 114, and sending engine 116 will now be described. Sensing engine 110, learner engine 112, actuator engine 114, and sending engine 116 may be implemented using any suitable combination of hardware, firmware, and software. In certain embodiments, a portion or all of sensing engine 110, learner engine 112, actuator engine 114, and sending engine 116 are implemented in software that is stored in memory 108 (or another suitable memory location accessible to communication device 102) for execution by processor 106 to implement the respective operations associated with each of these engines. Although described separately, sensing engine 110, learner engine 112, actuator engine 114, and sending engine 116 may be combined physically or functionally in any suitable manner.

In general, sensing engine 110 senses and collects signals from network 104, processes the signals, and outputs an array of values representing the current network states. Learner engine 112, which may include an online learning algorithm or a learned model, takes in the current network states and outputs certain “knowledge” based on the network states. Actuator engine 114 decides what actions to take based on the “knowledge” output from learner engine 112. Sending engine 116 sends packets based on the orders given by actuator engine 114.

Sensing engine 110 receives signals 118 from network 104. In certain embodiments, a signal 118 is and acknowledgement (ACK) received in response to a prior transmission of a packet 120 by communication device 102. For purposes of this description, it will be assumed that signals 118 and packets 120 received by or communicated by, respectively, communication device 102 are TCP signals. As described above, however, this disclosure contemplates other types of signals being communicated or received, if appropriate. Throughout this description, signals 118 may be referred to as acknowledgements or ACKs 118. Furthermore, throughout this description, signals 118 and packets 120 may be referred to in singular or plural form, depending on context.

In certain embodiments, a signal 118 is associated with a round-trip time. For example, signal 118 may include a round-trip time. Additionally or alternatively, sensing engine 110 may determine the round-trip time associated with a signal 118 based on information included in signal 118. Sensing engine 110 may compute statistics that reflect the congestion level of network 104. Such statistics may include the packet inter-sending time, ACK inter-arrival time, and round-trip times. In certain embodiments, a round-trip time associated with an ACK is the difference between the time a particular packet 120 is transmitted from communication device 102 (e.g., according to a clock accessible to communication device 102) and the time an ACK (e.g., signal 118) that was sent by another device in response to the particular packet 120 is received by device 102 (e.g., according to a clock accessible to communication device 102) . In certain embodiments, the ACK inter-arrival time, or more generically the signal inter-arrival time, refers to the time gap between receipt of signals 118 (e.g., ACKs) by communication device 102. In certain embodiments, the packet inter-sending time refers to the time gap between sending of packets 120 by communication device 102. Sensing engine 110 may compute other quantities, where appropriate.

Sensing engine 110 may process signals 118, combine signals 118 with variables maintained by communication device 102, and output a representation (e.g., an array or a vector) of the current network state, which may be intended to reflect the current congested state of network 104. The network state may be represented using a suitable data structure. In certain embodiments, the network state is represented as a feature array representing a congested condition of network 104 when a new signal 118 (e.g., a new ACK) is received. For example, the feature array may be a length-55 feature array, although this disclosure contemplates the feature array having any suitable length. In certain embodiments, the network state is represented as a state vector representing a congested condition of network 104 when a new signal 118 (e.g., a new ACK) is received. For example, the state vector may be a length-5 state vector, although this disclosure contemplates the state vector having any suitable length. The data structure representing the state (e.g., the state array or vector) may include any values.

Learner engine 112 serves as the “brain” of communication device 102 (for congestion control purposes) , learning the complex relationship between a certain state and possible actions. Depending on current conditions (e.g., the current network state and/or other information) reported by sensing engine 110, learner engine 112 may prompt actuator engine 114 to act appropriately. Proper design and training of learner engine 112 may facilitate a well-performing learning-based congestion control scheme. In a training stage, the learner engine 112 learns from interacting with the environment in order to optimize a goal. In a testing stage, the learner engine 112 applies the knowledge it learns and may or may not update its knowledge depending on the specific design. In certain embodiments, learner engine 112 includes an online learner engine or learned model. Learner engine 112 may take the current network state (e.g., as determined by sensing engine 110) and output a prediction that is based on the current network state.

Actuator engine 114 may act based on the prediction from learner engine 112. For example, actuator engine 114 may receive information from learner engine 112 and determine an appropriate action based on the received information. As another example, the information actuator engine 114 receives from learner engine 112 may identify a particular action, and actuator engine 114 may cause another element of communication device 102 (e.g., sending engine 116) to execute that action. The actions output by actuator engine 114 can be changes to the congestion window and/or to the sending rate, for example.

Sending engine 116 may act according to an instruction from actuator engine 114. For example, if actuator engine 114 instructs sending engine 116 to send a packet 120, sending engine 120 may send a packet 120.

A packet 120 may include any suitable type of data item that may be communicated via network 104. In certain embodiments, packets 120 are IP packets that are communicated according to the TCP.

As described above, system 100 may implement an LP-TCP congestion control mechanism, an RL-TCP congestion control mechanism, or both. Each of these congestion control algorithms are described in greater detail below with reference to system 100. In certain embodiments, the learning-based TCP congestion control schemes are based on NewReno, which means that slow start, fast retransmit, and fast recovery aspects of NewReno may be adopted. It is noted, however, that this disclosure is not limited to the use of NewReno, as other suitable congestion control schemes may be used to implement the disclosed LP-TCP and RL-TCP congestion control mechanisms.

LP-TCP CC

LP-TCP congestion control is based on supervised learning to facilitate congestion avoidance. TCP NewReno treats packet loss as a signal of network congestion. Each time a packet loss occurs, NewReno reduces the sending rate (e.g., by halving the congestion window size) and, as a result, the throughput may be reduced. In some scenarios, this flow may under-utilize the bottleneck bandwidth in an under-buffered network (e.g., network 104) . In certain embodiments, the LP-TCP congestion control design attempts to minimize the packet loss rate for each sender (communication device 102) . LP-TCP congestion control may predict and reduce packet loss events, lower the frequency of sending rate reduction, and strive for a better throughput. Supervised learning may be used to build a loss predictor in communication device 102 to predict whether congested packet loss will happen if a packet (e.g., packet 120) is sent. If congested loss exists and can be predicted, then congested loss might be avoided and flow throughput might be improved.

For LP-TCP congestion control, learner engine 112 may predict, based on a current network state received from sensing engine 110, packet loss and informs actuator engine 114 how likely a packet (e.g., packet 120) will be lost if sent via network 104. In certain embodiments, if a probability of the packet (e.g., packet 120) being lost is higher than a threshold, then actuator engine 114 does not cause sending engine 116 to send the packet (e.g., packet 120) . In such a scenario, actuator engine 114 may reduce the congestion window size by a suitable amount (e.g., by one) , if appropriate. In certain embodiments, if the probability of the packet (e.g., packet 120) being lost is less than or equal to the threshold, then actuator engine 114 causes sending engine 116 to send the packet (e.g., packet 120) . In contrast to this example, this disclosure contemplates the case of the probability being equal to the threshold being treated similarly to the probability exceeding the threshold, if desired for a particular implementation.

Sensing engine 110 may receive signals 118 as inputs. Signals 118 may be ACKs for packets 120 previously sent by communication device 102. Sensing engine 110 may determine, in response to receiving a signal 118 (e.g., an ACK) a current network state of network 104. The current network state determined by sensing engine 110 may be intended to reflect the current congestion state of network 104. Sensing engine 110 may determine the current network state based on signal 118 (e.g., the received ACK) and/or one or more other variables determined by and/or maintained by communication device 102.

Sensing engine 110 may capture the current network state of network 104 in any suitable manner, such as using a suitable type of data structure. As just one example, sensing engine 110 may represent the current network state of network 104 as a feature array, such as a length-55 features array) . In certain embodiments, the current network state of network 104 may be captured using any suitable combination of the following values: the current congestion window size, the order of the current packet (e.g., the packet 120 to be sent) in the congestion window, the exponentially weighted moving average (EWMA) of ACK inter-arrival time, time series of ACK inter-arrival time, minimum of ACK inter-arrival time, EWMA of packet inter-sending time, time series of packet inter-sending time, minimum of packet inter-sending time, time series of round-trip time, minimum of round-trip time, time series of ratios of ACK inter-arrival time, time series of ratios of packet inter-sending time, time series of ratios of round-trip times, a slow start threshold, and any other suitable variables. In certain embodiments, a time series of a variable includes the eight most recent samples of that variable, although this disclosure contemplates using any suitable number of samples for calculating the time series of a variable. In certain embodiments, the packet inter-sending time (and thus the features that depend on it) is computed based on the packets 120 communication device 102 is sending instead of from the time stamps of received signals 118 (e.g., ACKs) . Sensing engine 110 may provide the determined current network state of network 104 to learner engine 112.

Learner engine 112 may receive the current network state of network 104 from sensing engine 110. For example, learner engine 112 may receive the feature array determined by sensing engine 110. Learner engine 112 may determine, based on the current network state, an estimate of the probability of loss if a next packet 120 is transmitted via network 104. In certain embodiments, the determined loss probability reflects a likelihood that a next packet 120 will be lost if the next packet 120 is transmitted via network 104. As a particular example, the determined loss probability may be a percentage value, with a higher percentage reflecting a greater likelihood that the next packet 120 will be lost if the next packet 120 is transmitted via network 104. In certain embodiments, to determine the probability of loss if a next packet 120 is transmitted via network 104, learner engine 112 evaluates the current network state of network 104 based on the training learner engine 112 has received. Examples of this training are described in greater detail below. The result of the training may be that possible states of network 104 are mapped to particular probabilities of loss such that learner engine 112 may determine a probability of loss based on the current network state of network 104. Learner engine 112 may provide the determined loss probability to actuator engine 114.

Actuator engine 114 may receive the determined loss probability from learner engine 112. Actuator engine 114 may determine, based on the loss probability, an appropriate action. In certain embodiments, if a probability of the packet (e.g., packet 120) being lost is higher than a threshold, then actuator engine 114 does not cause sending engine 116 to send the packet (e.g., packet 120) . In this scenario, actuator engine 114 may reduce the congestion window size by a suitable amount (e.g., by one) , if appropriate. In certain embodiments, if the probability of the packet (e.g., packet 120) being lost is less than or equal to the threshold, then actuator engine 114 causes sending engine 116 to send the packet (e.g., packet 120) . In contrast to this example, this disclosure contemplates the case of the probability being equal to the threshold being treated similarly to the probability exceeding the threshold, if desired for a particular implementation. The output from the actuator engine 114 is whether to send the packet 120 or not to send the packet 120.

Additionally or alternatively, actuator engine 114 may determine whether to adjust the congestion window size. For example, if actuator engine 114 determines to send packet 120, actuator engine may cause the congestion window size to increase by 1/W. Although increasing the congestion window size by a particular amount (1/W) is primarily described, this disclose contemplates increasing the congestion window size by another amount or leaving the congestion window the same size if actuator engine 114 determines to send packet 120. As another example, if actuator engine 114 determines not to send packet 120, actuator engine may cause the congestion window size to decrease by one. Actuator engine 114 may provide the decision of whether or not to send packet 120 to sending engine 116.

Sending engine 116 receives the decision from actuator engine 114 and acts based on that determination. For example, if the decision from actuator engine 114 indicates that packet 120 should be transmitted via network 104, then sending engine 116 causes packet 120 to be transmitted via network 104. As another example, if the decision from actuator engine 114 indicates that packet 120 should not be transmitted via network 104, then sending engine 116 does not cause packet 120 to be transmitted via network 104.

In the LP-TCP congestion control mechanism, learner engine 112 is a loss predictor. For example, learner 112 receives the current network state of network 104 as an input and predicts a probability of a to-be-sent packet 120 being lost due to congestion should the to-be-sent packet 120 be sent. Learner 112 is built using a supervised learning technique. For example, the supervised learning technique may be the random forest learning technique. Although this disclosure focuses on an embodiment in which the supervised learning technique is the random forest learning technique, this disclosure contemplates training learner engine 112 using any suitable supervised learning technique. For learning, the LP-TCP congestion control-capable communication device 102 works in two phases, training and testing.

During the training phase, learner engine 112 of communication device 102 is trained using training data. The training phase consists of two steps, a training set generation step and a training step. Each of these steps is described below.

First, the sensing engine 110 generates a training set. In certain embodiments, training data to train learner engine 112 is collected through NewReno simulations run on an NS2. Each training vector (or other suitable data structure) includes sender states at the sending time of each packet and a label that corresponds to the delivery result of that packet. The sender states are updated when an ACK is received and when a packet is sent. In other words, when a packet 120 is sent (during simulation) , the current network state of network 104 is recorded as a feature vector (or other suitable data structure) just before the packet 120 is transmitted. If the packet 120 is successfully delivered (e.g., determined based on whether a corresponding ACK for this packet 120 is received) , the feature vector for the state (at the time of sending the packet 120) is assigned a corresponding label of 0; otherwise, the label is 1 (e.g., for loss) . Of course, this disclosure contemplates using different values for the labels, if appropriate. Since loss events generally occur less frequently than non-loss events, the collection of training data may be terminated when enough losses in the training data have been detected. The appropriate number of losses may be any suitable number, according to particular implementations. As just one particular example, the training data collection may last approximately 5000 seconds.

Second, a random forest model (or other appropriate supervised training model) is then trained with this training data. For example, a random forest classifier may be trained offline using the training data and installed into learner engine 112. For any feature vector (or other suitable data structure) representing a certain state, the model outputs the estimated probability of loss should a packet 120 be sent, which is the mean prediction of the decision trees in the forest. The output of the random forest classifier (learner engine 112) is soft and corresponds to the probability that a packet will be lost if sent.

Based on the time stamps recorded in returning ACKs, a length-55 feature vector (or other suitable data structure) is determined containing the current congestion window, the EWMA of the intervals between successive ACKs, the EWMA of the intervals between successive sending times, and any other suitable information. This data structure is input to learner engine 112.

During the testing phase, the sensing engine 110, the learner engine 112 (e.g., the random forest model) , and the actuator engine 114 work together as the loss prediction-based TCP congestion control mechanism. The sensing engine 110 computes current network state (e.g., the feature vector or other suitable data structure) of the same structure as in the training phase, based on the time stamps encoded in received signals 118 (e.g., ACKs) and/or other parameters. Before sending a packet, the trained model takes in the current network state (e.g., feature vector) , and outputs a probability of loss if a packet is sent. Based on a predefined threshold, the actuator engine 114 then decides whether or not to send the packet. If the actuator engine 114 decides to send the packet, the congestion window is not changed. If the actuator engine 114 decides not to send the packet, the congestion is decreased by one.

After testing and any changes resulting from the testing, the sensing engine 110, the learner engine 112 (e.g., the random forest model) , and the actuator engine 114 work together as the loss prediction-based TCP congestion control mechanism in congestion avoidance with real traffic in network 104. During congestion avoidance, when a new ACK is received by communication device 102 (e.g., by sensing engine 110) , the cwnd size expands by 1/cwnd, and sensing engine 110 updates the state. When sending engine 116 is about to send a packet 120, the state is computed again. Learner engine 112 then takes in the state vector, and outputs a probability of loss of that packet 120. If the probability of loss is lower than the pre-determined threshold th, then actuator engine 114 cause the packet 120 to be sent (e.g., by sending engine 116) . Otherwise, in certain embodiments, actuator engine 114 does not cause the packet 120 to be sent, and reduces the congestion window size by one.

The decision threshold th for comparison (e.g., by actuator engine 114) with the loss probability is a parameter of the LP-TCP congestion control mechanism, may be predetermined (e.g., prior to a collision avoidance phase) , and may be used to adjust performance of the LP-TCP congestion control mechanism. If the threshold is high, a prediction that a packet is lost is less likely (LP-TCP holds on to a packet only when it is highly confident that a packet loss will happen) , and thus the LP-TCP acts more conservatively and only reacts when there is near certainty about a decision. If the threshold is low, the LP-TCP is more likely to hold on to packets and thus avoid potential packet losses, but the opportunity to transmit when resources are available may be missed.

In addition to loss prediction accuracy, network users often care about the performance of throughput (tp) and delay (d) . Therefore, a throughput-delay tradeoff metric M _e is provided and defined as:

M _e=log (E (tp) ) -0.1 log (E (d) ) .

Directly computing the decision threshold th that maximizes M _e may be difficult. Thus, in certain embodiments, a first attempt may be performed by sampling th as 0.1, 0.3, 0.5, 0.7, 0.9. Among these samples and th ₀, a th that maximize M _e may be chosen. For various network configurations (L, K) , M _e may be computed for the threshold th candidates (see, e.g., Table 1) .

For example, Table 1, provided below, shows a throughput-delay tradeoff metric, M _e, as a function of the decision threshold th of the loss predictor at various network configurations. For each network setup (L , K) , a threshold th ₀ is computed that maximizes the F1 score of the loss predictor (LP) in Table 1. An F1 score is a measure of the accuracy of a test. The variable L means the size of the buffer at a potential network congestion point in network 104, and the variable K means the number of senders in network 104. The term (L, K) _mix means one LP-TCP and K -1 NewReno flows coexist in a bottleneck (e.g., the potential network congestion point in network 104) . The selected ths (corresponding to the bold-faced value of M _e) may be used for simulations, such as those described later in this disclosure.

Table 1

FIGURE 2 is a graph showing an F1 score of the LP-TCP congestion control scheme as a function of the decision threshold th, for various network setups, according to certain embodiments of this disclosure. For purposes of this example, the F1 score for each of the first four rows in Table 1 corresponds to a plotted line in the graph of FIGURE 2.

Returning to FIGURE 1, in operation of an example embodiment for LP-TCP congestion control, communication device 102 (e.g., sensing engine 110) receives a signal 118 from network 104. In certain embodiments, signal 118 is an ACK of a prior packet 120 communicated by communication device 102 via network 104. Communication device 102 (e.g., sensing engine 110) then determines a first network state. For example, sensing engine 110 may determine a variety of values based on the received signal 118 and other variables determined or maintained by communication device 102. In certain embodiments, sensing engine 110 captures the network state as a feature vector or other suitable data structure.

Communication device 102 (e.g., learner engine 112) next determines a first loss prediction for a first packet 120 based on the previously-determined first network state. Sensing engine 110 may communicate or otherwise make available to learner engine 112 the previously-determined first network state. The first loss prediction indicates the likelihood that the first packet 120 will be lost due to network congestion if the first packet 120 is transmitted via network 104. In certain embodiments, the loss prediction is a loss probability determined by learner engine 112. For example, the loss probability may be a percentage value, where a higher percentage indicates a higher likelihood that a packet will be lost if transmitted via network 104. Learner engine 112 may determine the loss prediction based on the trained model, such as the trained random forest model, which may map network states to particular loss predictions (e.g., loss probabilities) based on the training that learner engine 112 received during a training phase in which learner engine 112 is trained using simulated traffic.

Communication device 102 (e.g., actuator engine 114) then determines whether to send the first packet 120 based on the previously-determined loss prediction. Learner engine 112 may communicate or otherwise make available to actuator engine 114 the previously-determined loss probability. In certain embodiments, actuator engine 114 compares the loss prediction (e.g., the loss probability) for the first packet 120 to a decision threshold to determine whether to send the first packet 120. For example, actuator engine 114 may determine whether the loss probability exceeds the decision threshold.

If communication device 102 (e.g., actuator engine 114) determines not to send the first packet 120 (e.g., by determining that the loss probability for the first packet 120 exceeds the decision threshold) , then the first packet 120 is not sent and communication device 102 (e.g., actuator engine 114) may reduce the size of the congestion window by 1.

If, on the other hand, communication device 102 (e.g., actuator engine 114) determines to send the first packet 120 (e.g., by determining that the loss probability for the first packet 120 does not exceed the decision threshold) , then communication device 102 (e.g., sending engine 116) sends the first packet 120 via network 104. For example, actuator engine 114 may, in response to the decision to send the first packet 120, instruct sending engine 116 to send the first packet 120, and in response to that instruction, sending engine 116 may send the first packet 120. Communication device 102 (e.g., actuator engine 114) may then increase the size of the congestion window by 1/W. As described above, although increasing the congestion window size by a particular amount (1/W) is primarily described, this disclose contemplates increasing the congestion window size by another amount or leaving the congestion window the same size if actuator engine 114 determines to send packet 120.

Having sent packet 120 and, if appropriate, increased the size of the congestion window, communication device 102 may determine whether there is another packet 120 to send. If communication device 102 determines that there is not another packet 120 to send, then communication device 102 (e.g., sensing engine 110) may await arrival of another signal 118 (e.g., another ACK) .

If, on the other hand, communication device 102 determines that there is another packet 120 to send, then communication device 102 (e.g., sensing engine 110) determines a second network state. Communication device 102 (e.g., learner engine 112) may then determine a second loss prediction for a second packet 120 based on the second network state. The second loss prediction indicates the likelihood that the second packet 120 will be lost due to network congestion if the second packet 120 is transmitted via network 104.

Communication device 102 (e.g., actuator engine 114) determines whether to send the second packet 120 based on the second loss prediction. In certain embodiments, actuator engine 114 compares the second loss prediction (e.g., the loss probability) for the second packet 120 to the decision threshold to determine whether to send the second packet 120. For example, actuator engine 114 may determine whether the second loss probability exceeds the decision threshold.

If communication device 102 (e.g., actuator engine 114) determines not to send the second packet 120 (e.g., by determining that the second loss probability for the second packet 120 exceeds the decision threshold) , then the second packet 120 is not sent and communication device 102 (e.g., actuator engine 114) reduces the size of the congestion window by 1.

If, on the other hand, communication device 102 (e.g., actuator engine 114) determines to send the second packet 120 (e.g., by determining that the second loss probability for the second packet 120 does not exceed the decision threshold) , then communication device 102 (e.g., sending engine 116) sends the second packet 120 via network 104 awaits arrival of another signal 118 (e.g., another ACK) .

In operation of an example embodiment for training learner engine 112 of communication device 102 for LP-TCP congestion control, communication device 102 begins building training data for training learner engine 112. As described above, in certain embodiments, training data may be collected by running NewReno simulations on NS2. Communication device 102 (e.g., sensing engine 110) records a current network state of network 104. For example, prior to sending a packet 120, sensing engine 110 may determine and store a current network state of network 104. Then, communication device 102 (e.g., sending engine 116) transmits a packet 120. Communication device 102 (e.g., sensing engine 110) next determines whether an ACK for the communicated packet 120 has been received. For example, sensing engine 110 may be informed of the transmission of packet 120 and determine whether an ACK has been received within a predetermined amount of time. If an ACK for the communicated packet 120 has not been received within the predetermined amount of time, the sensing engine 110 may determine that the communicated packet 120 has been lost.

If communication device 102 (e.g., sensing engine 110) determines that an ACK has not been received for the communicated packet 120, then communication device 102 (e.g., sensing engine 110 or learner engine 112) may assign a first value to the previously-determined network state. In other words, if communication device 102 (e.g., sensing engine 110) determines that the communicated packet 120 has been lost, then communication device 102 (e.g., sensing engine 110 or learner engine 112) assigns a value indicating a lost packet to the previously-determined network state. In certain embodiments, the first value is 1, although this disclosure contemplates other values being used.

If, on the other hand, communication device 102 (e.g., sensing engine 110) determines that an ACK has been received for the communicated packet 120, then communication device 102 (e.g., sensing engine 110 or learner engine 112) may assign a second value to the previously-determined network state. In other words, if communication device 102 (e.g., sensing engine 110) determines that the communicated packet 120 was not lost, then communication device 102 (e.g., sensing engine 110 or learner engine 112) assigns a value indicating that the communicated packet 120 was not lost to the previously-determined network state. In certain embodiments, the second value is 0, although this disclosure contemplates other values being used.

Regardless of whether communication device 102 (e.g., sensing engine 110) assigns the first or second value to the previously-determined network state (e.g., regardless of whether communication device 102 determines that first packet 120 has been lost) , communication device 102 then may assign the network state and associated value to the training data. For example, communication device 102 may store the representation of the network state (e.g., the feature vector for the network state) and the associated assigned value (e.g., 1 for lost packet or 0 for not a lost packet) as part of the training data.

Communication device 102 next may determine whether enough losses have been detected. As described above, because lost packet events generally occur less frequently than non-lost packet events, the collection of training data may be terminated when enough losses in the training data have been detected. As an example, communication device 102 may track the number of lost packets that have been detected during collection of the training data and compare the current number of lost packets to a threshold number of lost packets. If the current number of lost packets meets or exceeds the threshold, then communication device 102 may determine that enough losses have been detected. If the current number of lost packets does not meet or exceed the threshold, then communication device 102 may determine that enough losses have not been detected and that collection of training data should continue.

If communication device 102 determines that enough losses have not been detected, then communication device 102 may again record the network state of network 104 in anticipation of transmitting another packet 120.

If, on the other hand, communication device 102 determines that enough losses have been detected, then communication device 102 may use the collected training data to train the classification/regression model being used to train learner engine 112. In one example, the classification/regression model used by communication device 102 is a random forest model. Communication device 102 then may apply the trained model to learner engine 112. Applying the trained model to learner engine 112 may include storing the trained model in a manner that is accessible to learner engine 112, so that learner engine 112 may evaluate future network states reported by sensing engine 110 to determine an appropriate loss prediction (e.g., loss probability) to determine for those future network states. For example, when adequately trained, learner engine 112 may be able to find a matching state (or at least a similar state) in the trained model for future network states reported by sensing engine 110.

RL-TCP CC

RL-TCP congestion control uses reinforcement learning to train learner engine 112. In general, reinforcement learning is a technique for determining the best action to take given a current state of the world (e.g., a current network state of network 104 in the context of FIGURE 1) . In some implementations of reinforcement learning, the “world” is described using Markov Decision Processes (MDPs) , where the task has some state space S, available actions A, transition function P (s, a, s') (which describes how the agent, or in this case communication device 102, will move through the state space given a current network state s and selected action a) , and reward function R (s, a) that describes the feedback the agent will receive after selecting action a in state s. The value of taking action a in state s is defined as the total reward received after selecting action a and then continuing optimally into the future.

Temporal difference (TD) learning, which is referenced below, is a technique for learning Q-values in an environment where the transition and reward functions are unknown, and instead are sampled by exploring the environment. Temporal difference learning accomplishes this sampling by taking advantage of the fact that a Q value is a prediction that can be compared against observed data. An example of temporal difference learning is the State-Action-Reward-State-Action (SARSA) technique, described in greater detail below.

As opposed to the LP-TCP embodiment, which uses supervised learning to teach a loss predictor, in certain embodiments the RL-TCP congestion control scheme finds optimum mappings between states and actions by trial-and-error interactions with a network environment during a congestion avoidance stage. This may allow the RL-TCP congestion control scheme to continuously learn and adapt to a dynamic network environment, given an objective. In this sense, the TCP congestion control challenge may be viewed as a reinforcement learning problem in which an agent (e.g., communication device 102) lacking prior knowledge learns to act by acting and receiving a reward (positive or negative) from the environment (the network environment) , with a goal of maximizing some kind of cumulative rewards. In certain embodiments, as will be described below, the RL-TCP congestion control scheme tailors the design of states and action spaces to networks with under-buffered bottleneck links. Additionally or alternatively, as will be described in greater detail below, the RL-TCP congestion control scheme may treat the temporal credit assignment of reward according to actual TCP dynamics. In certain embodiments, outside the congestion avoidance stage, practices from TCP NewReno are kept.

In certain embodiments, the goal of the RL-TCP congestion control scheme is represented using a utility function. Although certain utility functions are included throughout this disclosure, these utility functions are included for example purposes only. This disclosure contemplates using the described utility function (s) and/or any other suitable utility function (s) , according to particular implementations. One example utility function defines the goal of the RL-TCP congestion control scheme as maximizing the average throughput-to-delay ratio for each sender, e.g.,

arg min _TCP CC E _flow {log (throughput) –δlog (RTT) }

According to embodiments of the RL-TCP congestion control scheme, sensing engine 110, learner engine 112, and actuator engine 114 work together to fulfill the goal of an RL agent.

For the RL-TCP congestion control scheme, an input to sensing engine 110 is a signal 118 such as a received ACK to a previously transmitted packet 120. Inputs to learner engine 112 include a network state (e.g., determined by sensing engine 110) and a reward from the network. The network state may be represented as one or more values (e.g., stored as a feature vector) . These values may include the EWMA of the inter-arrival time between newly received ACKs (discretized into 10 intervals in one example) , the EWMA of the inter-arrival time between packets sent by sender engine 116 (discretized into 10 intervals in one example) , the ratio between the current RTT and the best RTT found so far (discretized into 10 intervals in one example) , and the slow start threshold (discretized into 10 intervals in one example) . An input to actuator engine 114 may be a value function of current network state and actions, indicating how desirable each action is at the current network state.

In response to a signal 118 (e.g., an ACK) , sensing engine 110 updates a current network state for network 104. The network state represents the congested condition of network 104. Although the network state may be represented using any suitable type of data structure, in certain embodiments, the network state is represented using a state vector, such as a length-5 state vector for example. In one example of the RL-TCP congestion control scheme, the packet inter-sending time (and thus the features that depend on it) is computed from the time-stamps in the received signals 118 (e.g., ACKs) .

At the beginning of each time step (generally one RTT) , sensing engine 110 sends the current network state (e.g., the current network state vector) to learner engine 112. Each feature in the state vector is uniformly discretized into l levels over a range where most of the values for that feature occur. In certain embodiments, l is between 10 and 20. Sensing engine 110 may also compute a reward r from the network, based on the changes of utility U at two consecutive time steps. The utility U may be a function of throughput tp, delay d (wherein d = RTT -RTT _min) , and packet loss rate p. For example, the utility U may be represented by the following utility function, which provides another example of a utility function that may be used according to certain embodiments of this disclosure:

where tp is the throughput, B is a bandwidth of a bottleneck in the network, d is the delay calculated as a difference between a current round-trip time (RTT) and a minimum RTT, p is the packet loss rate, and δ ₁ and δ ₂ are adjustable coefficients.

At the beginning of each time step, actuator engine 114 selects an action. Actuator engine 114 has an action space A containing four actions. The actions, for example, may be changes to the congestion window size. In one example, the actions of the action space are cwnd = cwnd + x, where x = -1, 0, +1, +3. In certain embodiments, choosing a relatively small x reduces the chances of incurring packet drops at an under-buffered bottleneck.

An action is actually “spread” into one round-trip time (e.g., cwnd = cwnd +x/cwnd, for each new ACK received during one RTT) . Using value-based reinforcement learning algorithms, learner engine 112 learns how “good, ” or desirable, each action a is at a state s, called the Q-function Q (s, a) in reinforcement learning, and defined as the cumulative rewards that the agent (e.g., communication device 102) receives for performing action a at state s and acting optimally subsequently.

In one example, to learn the Q-function Q (s, a) , the SARSA learning algorithm is used. SARSA is an on-policy temporal difference learning algorithm for value-based RL. At the beginning of time step n + 1, a SARSA agent (e.g., learner engine 112 of communication device 102) updates the Q-function using the following assignment (referenced herein as the unmodified Q-function assignment) :

where

means y ₁= (1-α) y ₁+αy ₂; s _i, a _i,r _i are state, action, and reward computed at the beginning of time step i; α _t is the learning rate as a function of time t (e.g., in seconds) ; and γ is the discount factor. The reward r _n+1 is computed based on the following difference:

Δ _n+1=U _n+1-U _n,

where U _i is the utility during time step i-1. However, in TCP congestion control, the effect of adjusting the congestion window at time t on the utility is observed by the sender (e.g., communication device 102) at time t + RTT. Thus, the effect of action a _n (that “spreads” into time step n) will be observed by the sender (e.g., communication device 102) during time step n + 1, and reflected in U _n+2.

As a result, using a reward r _n+2 may better capture the reward from the environment due to action a _n. Thus, in certain embodiments, at the beginning of time step n + 1, the update rule for the unmodified Q-function assignment may be modified to the following assignment (referenced herein as the modified Q-function assignment) :

where

If the true Q-function were known, at the beginning of time step n+1, actuator engine 114 could act optimally by choosing the greedy action a _n+1 = arg max _a∈A Q (s _n+1, a) . In general, however, learner engine 112 estimates the Q-function. Additionally, in a dynamic environment, the optimal policy may change. Thus, in certain embodiments, actuator engine 114 uses the ∈-greedy exploration-exploitation (E2) scheme, where actuator engine 114 chooses a greedy action with probability 1-∈, and a random action with probability ∈. In one example, ∈ is set to 0.1 by default. In certain embodiments, continual exploration facilitates maximization of utility in a dynamic network environment.

Table 2 below shows pseudocode of an example RL-TCP congestion control technique during congestion avoidance, according to certain embodiments of this disclosure. In this example, SARSA learning is used.

Table 2: RL-TCP CC (at congestion avoidance stage)

In operation of an example embodiment for RL-TCP congestion control, communication device 102 (e.g., learner engine 112) initializes the reinforcement learning parameters. For example, learner engine 112 may set values of the reinforcement learning parameters to respective initial values or sets of values. In certain embodiments, the reinforcement learning parameters include one or more action variables (a) , one or more network state variables (s) , one or more utility variables (u) , one or more time variables (t) , and one or more desirability values according to respective value functions. As a more particular example, the reinforcement learning parameters may include the following: action variables a ₀, a ₁, a ₂; network state variables s ₀, s ₁, s ₂; utility variables u ₁, u ₁; time variable (t ₀, t _c) ; and value functions Q (s ₀, a ₀) and Q (s ₁, a ₁) .

Each action variable (a) may store one or more values representing an action, each network state variable (s) may store one or more values representing a network state, each utility variable (u) may store one or more values representing a utility, each time variable (t) may store one or more values representing a time, and each value function Q (s, a) may store one or more values representing a desirability value. Although sometimes referenced in the singular, a “value” may include one or more values stored in any suitable format (e.g., as a single value, as an array of values, as a vector, etc. ) . Thus, a variable may store one or more values in any suitable format.

For purposes of this description, the following terms may be used to refer to the reinforcement learning parameter identified in parenthesis following the term: first action variable (a ₀) ; second action variable (a ₁) ; third action variable (a ₂) ; first network state variable (s ₀) ; second network state variable (s ₁) ; third network state variable (s ₂) ; first utility variable (u ₁) ; second utility variable (u ₂) ; first time variable (t ₀) ; second time, or current time, variable (t _c) ; first value function (Q (s ₀, a ₀) ) ; and second value function (Q (s ₁, a ₁) ) .

In one example, initializing the reinforcement learning parameters includes setting initial values of the reinforcement parameters as follows:

● initialize the first action variable (a ₀) , the second action variable (a ₁) , and the third action variable (a ₂) to respective values of zero;

● initialize the first network state variable (s ₀) , the second network state variable (s ₁) , and the third network state variable (s ₂) to respective values of zero vectors;

● initialize the first utility variable (u ₁) and the second utility variable (u ₂) to respective values of zero; and

● initialize the first time variable (t ₀) to respective values of zero.

Communication device 102 (e.g., sensing engine 110) may determine whether a signal 118 has been received by communication device 102 from network 104. In certain embodiments, signal 118 is an ACK of a prior packet 120 communicated by communication device 102 via network 104. For example, signal 118 may be an ACK received by communication device 102 in response to a previous packet transmission by communication device 102. Although an explicit determination is described, this disclosure contemplates communication device 102 (e.g., sensing engine 110) simply detecting the receipt of the signal 118 (e.g., an ACK) in response to communication device 102 receiving the signal 118.

If communication device 102 (e.g., sensing engine 110) determines that signal 118 has not been received (e.g., does not detect receipt of signal 118) , then communication device (e.g., sensing engine 110) may continue to await receipt of a signal 118.

If, on the other hand, communication device 102 (e.g., sensing engine 110) determines that a signal 118 (e.g., an ACK) has been received, then communication device 102 (e.g., sensing engine 110) determines a network state of network 104. The network state represents a congested condition of network 104. The network state may be determined based at least in part on the received signal 118 (e.g., the ACK) . Furthermore, the network state may represent the state of network 104 at the time at which signal 118 is received since the network state is determined from signal 118. This determined network state may be considered the current network state of network 104 (e.g., at time t _c) .

In certain embodiments, determining a network state of network 104 includes determining values for one or more network state variables representing the network state. The network state may be represented using a suitable data structure. In certain embodiments, the network state (e.g., the first network state) is represented as a state vector representing a congested condition of network 104 when a new signal 118 (e.g., a new ACK) is received) . For example, the state vector may be a length-5 state vector, although this disclosure contemplates the state vector having any suitable length. The data structure representing the state (e.g., the state vector) may include any values, such as those described above.

Communication device 102 (e.g., sensing unit 110) may store the determined network state. For example, communication device 102 (e.g., sensing engine 110) may store values representing the determined network state into the third network state variable (s ₂) , which may be the current network state (e.g., at time t _c) .

Using a utility function, communication device 102 (e.g., sensing engine 110) may determine a second utility value. In certain embodiments, the utility function incorporates variables that may be reflected in the network state variable (e.g., values of the vector that represents the network state) such that determining the second utility value using the utility function includes determining the second utility value according to the current network state. As an example, the second utility value, as determined using the utility function, may be a function of throughput (tp) , delay (d) , and packet loss rate (p) . In certain embodiments, the utility function is:

where tp is the throughput, B is a bandwidth of a bottleneck in the network, d is the delay calculated as a difference between a current round-trip time and a minimum RTT, p is the packet loss rate, and δ ₁ and δ ₂ are adjustable coefficients.

Communication device 102 (e.g., sensing engine 110) may store the determined second utility value. For example, communication device 102 (e.g., sensing engine 110) may store the determined second utility value into the second utility variable (u ₂) . This second utility value may correspond to a time period from t ₀ up to the current time (t _c) .

Communication device 102 (e.g., sensing engine 110) may determine whether the difference between a second time (e.g., a current time, t _c) and a first time (e.g., a value of t ₀) is greater than or equal to a round-trip time. For example, communication device 102 (e.g., sensing engine 110) may determine whether the difference between the current time (the value of t _c) and the value of t ₀ is greater than or equal to the EWMA of the round trip time of a packet in network 104. In certain embodiments, the second time is the current time, and the current time (the value of t _c) is the time at which signal 118 was received. On a first pass through this operation, the value of t ₀ is the value to which t ₀ was initialized. On later passes through this operation, the value of t ₀ may be updated, as described below. For example, on a later pass, the value of t ₀ may be the last time for which a reward was computed.

If communication device 102 (e.g., sensing engine 110) determines that the difference between the second time (e.g., the current time) and the first time (e.g., a value of t ₀) is not greater than or equal to the round trip time, then communication device 102 (e.g., actuator engine 114) may update the congestion window according to a current action (the value of a ₁) . For example, communication device 102 (e.g., actuator engine 114) may update a value of the cwnd variable according to the current value of a ₁. In one example, the update to the congestion window may be expressed as cwnd = cwnd + a ₁/cwnd, where the value of a ₁ is the current action. Communication device 102 then may await receipt of another signal 118 (e.g., another ACK) . If communication device 102 continues to receive signals 118 without a round-trip time passing (e.g., with t _c–t ₀ ≥ RTT being true) , then communication device 102 may update the value of the cwnd variable with the current value of the second action variable (a ₁) multiple times prior to updating the value of the cwnd variable based on a newly-determined (e.g., third) action (although the newly-determined third action might be the same as or different than the current value of a ₁) .

If, on the other hand, communication device 102 (e.g., sensing engine 110) determines that the difference between the second time (the current time) and the first time (e.g., the value of t ₀) is greater than or equal to the round trip time, then communication device 102 (e.g., sensing engine 110) may determine, based on the second utility value of the second utility variable (u ₂) and a first utility value of the first utility variable (u ₁) , a reward. The reward may correspond to a first action represented by the value of the first action variable (a ₀) . Thus, in this example, communication device 102 (sending engine 110) determines the reward based at least in part on determining that a difference between the current time and the value of t ₀ is greater than or equal to a round-trip time.

On a first pass through this operation, the value of u ₁ is the value to which u ₁ was initialized. On later passes through this operation, the value of u ₁ may be updated, as described below. In certain embodiments, the first utility value of the first utility variable (u ₁) is associated with a first time period from a first time (atime that is at least a RTT prior to t ₀ in this example) to a second time (t ₀ in this example) , and the second utility value of the second utility variable (u ₂) is associated with a second time period from the second time (t ₀ in this example) to a third time (the current time, t _c, in this example) . The first time period with which the first utility value of the first utility variable (u ₁) is associated may correspond to a first round trip time, and the second time period with which the second utility value of the second utility variable (u ₂) is associated may correspond to a second round trip time. In this example, the difference between the second time and the first time (the first time period) is at least a round-trip time of a first packet 120 in network 104, and the difference between the third time and the second time is at least a round-trip time of a second packet 120 in network 104.

In certain embodiments, such as on a pass other than the initial pass, the first utility value of the first utility variable (u ₁) may be calculated using the utility function (as opposed to simply being initialized to a value of zero) . In certain embodiments, the utility function used to determine the preceding utility (e.g., the first utility value of the first utility variable (u ₁) ) is the same utility function that was used to determine the second network utility (e.g., the second utility value of the second utility variable (u ₂) ) .

In certain embodiments, the reward may be determined according to a difference between the second utility value of the second utility variable (u ₂) and the first (e.g., preceding) utility value of the first utility function (u ₁) , which may be expressed as follows:

Δ _n+1=U _n+1-U _n,

where Δ _n+1 is the difference being determined, U _n+1 is the second utility variable, and U _n is the first (e.g., preceding) utility variable. Furthermore, the particular value for the reward may be determined according to the following conditional:

and

where r _n+1 is the reward being determined.

In certain embodiments, sensing engine 110 may communicate the first network state and the reward to learner engine 112.

Communication device 102 (e.g., learner engine 112) may update a first value function indicating a desirability value associated with values for a first action variable (a ₀) and a first network state variable (s ₀) of communication network 104. For example, in an initial pass, the values for a ₀ and s ₀ may be the values to which variables a ₀ and s ₀ were initialized. As another example, in a later pass, the values for variables a ₀ and s ₀ may represent a first action at a first time (e.g., at the beginning of the first time period described above) and the first network state at the first time. The first value function may be updated according to the reward and a second value function indicating a desirability value associated with values for a second action variable (a ₁) and a second network state variable (s ₁) of communication network 104. In certain embodiments, the value function is a Q-function that is trained using a SARSA temporal difference learning algorithm.

As one particular example, the value function is updated as follows:

where

means y ₁= (1-α) y ₁+αy ₂; s _i, a _i, r _i are values for state, action, and reward variables computed at the beginning of time step i; α _t is the learning rate as a function of time t (e.g., in seconds) ; and γ is the discount factor. The time n-1 may be the first time, n may be the second time; and n+1 may be the third (current) time. Thus, the above value function may be expressed as follows:

This approach evaluates the desirability of an action at the first time based on the desirability of an action taken at a second time and a reward determined at a third time. In other words, in certain embodiments, the desirability of the action taken at the first time is evaluated after two time periods have passed, potentially more completely reflecting the impact of the action taken at the first time on network 104.

Learner engine 112 may communicate the value function to actuator engine 114.

Communication device 102 (e.g., actuator engine 114) may determine, according to the updated value function (and associated desirability value) , an action of a plurality of actions for the value of the third network state variable. The actions may be multiple possible actions that are selectable by actuator engine 114. Each action may include modifying a size of a congestion window (cwnd) by a certain amount (e.g., by a certain value) . In one particular example, the possible actions are cwnd + x, where x is -1, 0, +1, or +3. The determined action may be stored as the value of the third action variable (a ₂) .

The determined action (as represented by the value of the third action variable (a ₂) ) might be the same action as or a different action than the first and second actions (as represented by the values of the first action variable (a ₀) and the second action variable (a ₁) ) , which also might be the same as or different than one another. In other words, the first action might be the same as the second action; the first action might be the same as the third action; the second action might be the same as the third action; or the first action, the second action, and the third action might all be the same action. Additionally or alternatively, some or all of the first, second, and third actions might be different from one another. In certain embodiments, communication device 102 (e.g., actuator engine 114) determines the third action according to an ∈-greedy E2 scheme. This third action may be determined for the current network state. Communication device 102 (e.g., actuator engine 114) may store the determined third action. For example, communication device 102 (e.g., sensing engine 110) may store the determined third action as the third action (a ₂) .

Communication device 102 (e.g., actuator engine 114) may update the reinforcement learning parameters. In certain embodiments, communication device 102 (e.g., actuator engine 114) updates the reinforcement learning parameters such that values assigned to the parameters are advanced one time step. For example, communication device 102 (e.g., learner engine 112) may update the above-discussed reinforcement learning parameters as follows:

● assign the value of the second action variable (a ₁) to the first action variable (a ₀) (e.g., such that the value of the first action variable is replaced with the value of the second action variable) ;

● assign the value of the third action variable (a ₂) to the second action variable (a ₁) (e.g., such that the value of the second action variable is replaced with the value of the third action variable) ;

● assign the value of the second network state variable (s ₁) to the first network state variable (s ₀) (e.g., such that the value of the first network state variable is replaced with the value of the second network variable) ;

● assign the value of the third network state variable (s ₂) to the second network state variable (s ₁) (e.g., such that the value of the second network state variable is replaced with the value of the third network state variable) ;

● assign the value of the second utility variable (u ₂) to the first utility variable (u ₁) (e.g., such that the value of the first utility variable is replaced with the value of the second utility variable) ; and

● assign the value of the current time variable (t _c) to the first time variable (t ₀) (e.g., such that the value of the first time variable is replaced with the value of the current time variable) .

Communication device 102 (e.g., actuator engine 114) may update the congestion window according to the current action (a ₁) (e.g., with the newly assigned value) . For example, communication device 102 (e.g., actuator engine 114) may update a value of the cwnd variable according to the current value of a ₁. In one example, the update to the congestion window may be expressed as cwnd = cwnd + a ₁/cwnd, where the value of second action variable (a ₁) is the current action. Communication device 102 may then await a new signal 118 (e.g., a new ACK) .

FIGURE 3 illustrates an example method 300 for LP-TCP congestion control, according to certain embodiments of this disclosure. The method begins at step 302.

At step 304, communication device 102 (e.g., sensing engine 110) receives a signal 118 from network 104. In certain embodiments, signal 118 is an ACK of a prior packet 120 communicated by communication device 102 via network 104.

At step 306, communication device 102 (e.g., sensing engine 110) determines a first network state. For example, sensing engine 110 may determine a variety of values based on the signal 118 received at step 304 and other variables determined or maintained by communication device 102. The first network state represents a congested condition of network 104. The first network state may be determined based on signal 118 (e.g., the ACK) received at step 304. The network state may be represented using a suitable data structure. In certain embodiments, sensing engine 110 captures the network state as a feature vector (which also may be referred to as a state vector) or other suitable data structure. In certain embodiments, the network state (e.g., the first network state) is represented as a state vector representing a congested condition of network 104 when a new signal 118 (e.g., a new ACK) is received) . For example, the state vector may be a length-5 state vector, although this disclosure contemplates the state vector having any suitable length.

The data structure representing the state (e.g., the state vector) may include values for an exponentially weighted moving average (EWMA) of ACK inter-arrival time, the first signal being an ACK; an EWMA of packet inter-sending time; a ratio of current RTT and a minimum RTT; a slow start threshold; a congestion window (cwnd) size; or a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the cwnd size. Although particular example, values are described, this disclosure contemplates the network state of network 104 be represented in any suitable manner using any suitable values.

At step 308, communication device 102 (e.g., learner engine 112) determines a first loss prediction for a first packet 120 based on the first network state determined at step 306. Sensing engine 110 may communicate or otherwise make available to learner engine 112 the first network state determined at step 306. The first loss prediction indicates the likelihood that the first packet 120 will be lost due to network congestion if the first packet 120 is transmitted via network 104. In certain embodiments, the loss prediction is a loss probability determined by learner engine 112. For example, the loss probability may be a percentage value, where a higher percentage indicates a higher likelihood that a packet will be lost if transmitted via network 104. Learner engine 112 may determine the loss prediction based on the trained model, such as the trained random forest model, which may map network states to particular loss predictions (e.g., loss probabilities) .

At step 310, communication device 102 (e.g., actuator engine 114) determines whether to send the first packet 120 based on the loss prediction determined at step 308. Learner engine 112 may communicate or otherwise make available to actuator engine 114 the loss probability determined at step 308. In certain embodiments, actuator engine 114 compares the loss prediction (e.g., the loss probability) for the first packet 120 to a decision threshold to determine whether to send the first packet 120. For example, actuator engine 114 may determine whether the loss probability exceeds the decision threshold.

If communication device 102 (e.g., actuator engine 114) determines at step 310 not to send the first packet 120 (e.g., by determining that the loss probability for the first packet 120 exceeds the decision threshold) , then the first packet 120 is not sent and the method proceeds to step 312. At step 312, communication device 102 (e.g., actuator engine 114) reduces the size of the congestion window by 1.

If communication device 102 (e.g., actuator engine 114) determines at step 310 to send the first packet 120 (e.g., by determining that the loss probability for the first packet 120 does not exceed the decision threshold) , then the method proceeds to step 314. At step 314, communication device 102 (e.g., sending engine 116) sends the first packet 120 via network 104. For example, actuator engine 114 may, in response to the decision at step 310 to send the first packet 120, instruct sending engine 116 to send the first packet 120, and in response to that instruction, sending engine 116 may send the first packet 120 at step 314. At step 316, communication device 102 (e.g., actuator engine 114) increases the size of the congestion window by 1/W. As described above, although increasing the congestion window size by a particular amount (1/W) is primarily described, this disclose contemplates increasing the congestion window size by another amount or leaving the congestion window the same size if actuator engine 114 determines to send packet 120.

At step 318, communication device 102 determines whether there is another packet 120 to send. If communication device 102 determines at step 318 that there is not another packet to send, then the method returns to the beginning to await arrival of another signal 118 (e.g., another ACK) . If, on the other hand, communication device 102 determines at step 318 that there is another packet to send, then the method proceeds to step 320.

At step 320, communication device 102 (e.g., sensing engine 110) determines a second network state. At step 322, communication device 102 (e.g., learner engine 112) determines a second loss prediction for a second packet 120 based on the second network state determined at step 320. The second loss prediction indicates the likelihood that the second packet 120 will be lost due to network congestion if the second packet 120 is transmitted via network 104.

At step 324, communication device 102 (e.g., actuator engine 114) determines whether to send the second packet 120 based on the loss prediction determined at step 322. In certain embodiments, actuator engine 114 compares the loss prediction (e.g., the loss probability) for the second packet 120 to the decision threshold to determine whether to send the second packet 120. For example, actuator engine 114 may determine whether the loss probability exceeds the decision threshold.

If communication device 102 (e.g., actuator engine 114) determines at step 324 not to send the second packet 120 (e.g., by determining that the loss probability for the second packet 120 exceeds the decision threshold) , then the second packet 120 is not sent and the method proceeds to step 326. At step 326, communication device 102 (e.g., actuator engine 114) reduces the size of the congestion window by 1.

If communication device 102 (e.g., actuator engine 114) determines at step 324 to send the second packet 120 (e.g., by determining that the loss probability for the second packet 120 does not exceed the decision threshold) , then the method proceeds to step 328. At step 328, communication device 102 (e.g., sending engine 116) sends the second packet 120 via network 104. The method then returns to the beginning to await arrival of another signal 118 (e.g., another ACK) .

FIGURE 4 illustrates an example method 400 for training learner engine 112 of communication device 102 for LP-TCP congestion control, according to certain embodiments of this disclosure. The method begins at step 402.

At step 404, communication device 102 begins building training data for training learner engine 112. As described above, training data may be collected by running NewReno simulations on NS2.

At step 406, communication device 102 (e.g., sensing engine 110) records a current network state of network 104. For example, prior to sending a packet 120, sensing engine 110 may determine and store a current network state of network 104. In one example, the network state is represented as a feature vector. In certain embodiments, the network state is captured using a feature vector or other suitable data structure, as described above with reference to step 306 in FIGURE 3.

At step 408, communication device 102 (e.g., sending engine 116) transmits a packet 120. At step 410, communication device 102 (e.g., sensing engine 110) determines whether an acknowledgement (ACK) for the communicated packet 120 has been received. For example, sensing engine 110 may be informed of the transmission of packet 120 and determine whether an ACK has been received within a predetermined amount of time. If an ACK for the communicated packet 120 has not been received within the predetermined amount of time, the sensing engine 110 may determine that the communicated packet 120 has been lost.

If communication device 102 (e.g., sensing engine 110) determines at step 410 that an ACK has not been received for the communicated packet 120, then at step 412 communication device 102 (e.g., sensing engine 110 or learner engine 112) may assign a first value to the network state that was recorded at step 406. In other words, if communication device 102 (e.g., sensing engine 110) determines at step 410 that the communicated packet 120 has been lost, then at step 412 communication device 102 (e.g., sensing engine 110 or learner engine 112) assigns a value indicating a lost packet to the network state that was recorded at step 406. In certain embodiments, the first value is 1, although this disclosure contemplates other values being used. The method then proceeds to step 416.

If, on the other hand, communication device 102 (e.g., sensing engine 110) determines at step 410 that an ACK has been received for the communicated packet 120, then at step 414 communication device 102 (e.g., sensing engine 110 or learner engine 112) may assign a second value to the network state that was recorded at step 406. In other words, if communication device 102 (e.g., sensing engine 110) determines at step 410 that the communicated packet 120 was not lost, then at step 414 communication device 102 (e.g., sensing engine 110 or learner engine 112) assigns a value indicating that the communicated packet 120 was not lost to the network state that was recorded at step 406. In certain embodiments, the second value is 0, although this disclosure contemplates other values being used. The method then proceeds to step 416.

At step 416, communication device 102 assigns the network state and associated value to the training data. For example, communication device 102 may store the representation of the network state (e.g., the feature vector for the network state) and the associated assigned value (e.g., 1 for lost packet or 0 for not a lost packet) as part of the training data.

At step 418, communication device 102 determines whether enough losses have been detected. As described above, because lost packet events generally occur less frequently than non-lost packet events, the collection of training data may be terminated when enough losses in the training data have been detected. As an example, communication device 102 may track the number of lost packets that have been detected during collection of the training data and, at step 418, compare the current number of lost packets to a threshold number of lost packets. If the current number of lost packets meets or exceeds the threshold, then communication device 102 may determine that enough losses have been detected. If the current number of lost packets does not meet or exceed the threshold, then communication device 102 may determine that enough losses have not been detected and that collection of training data should continue.

If communication device 102 determines at step 418 that enough losses have not been detected, then the method may return to step 406 to record the network state of network 104 in anticipation of transmitting another packet 120 at step 408. If, on the other hand, communication device 102 determines at step 418 that enough losses have been detected, then the method may proceed to step 420.

At step 420, communication device 102 may use the collected training data to train the classification/regression model being used to train learner engine 112. In one example, the classification/regression model used by communication device 102 is a random forest model.

At step 422, communication device 102 applies the trained model to learner engine 112. Applying the trained model to learner engine 112 may include storing the trained model in a manner that is accessible to learner engine 112, so that learner engine 112 may evaluate future network states reported by sensing engine 110 to determine an appropriate loss prediction (e.g., loss probability) to determine for those future network states. For example, when adequately trained, learner engine 112 may be able to find a matching state (or at least a similar state) in the trained model for future network states reported by sensing engine 110.

At step 424, the method ends.

FIGURE 5 illustrates an example method 500 for RL-TCP congestion control, according to certain embodiments of this disclosure. The method begins at step 502.

At step 504, communication device 102 (e.g., learner engine 112) initializes the reinforcement learning parameters. For example, learner engine 112 may set values of the reinforcement learning parameters to respective initial values or sets of values. As described above, in certain embodiments, the reinforcement learning parameters include one or more action variables (a) , one or more network state variables (s) , one or more utility variables (u) , one or more time variables (t) , and one or more desirability values according to respective value functions. As a more particular example, the reinforcement learning parameters may include the following: action variables a ₀, a ₁, a ₂; network state variables s ₀, s ₁, s ₂; utility variables u ₁, u ₁; time variable (t ₀, t _c) ; and value functions Q (s ₀, a ₀) and Q (s ₁, a ₁) .

As discussed above, each action variable (a) may store one or more values representing an action, each network state variable (s) may store one or more values representing a network state, each utility variable (u) may store one or more values representing a utility, each time variable (t) may store one or more values representing a time, and each value function Q (s, a) may store one or more values representing a desirability value. Although sometimes referenced in the singular, a “value” may include one or more values stored in any suitable format (e.g., as a single value, as an array of values, as a vector, etc. ) . Thus, a variable may store one or more values in any suitable format.

● initialize the first time variable (t ₀) to respective values of zero.

At step 506, communication device 102 (e.g., sensing engine 110) determines whether a signal 118 has been received by communication device 102 from network 104. In certain embodiments, signal 118 is an ACK of a prior packet 120 communicated by communication device 102 via network 104. For example, signal 118 may be an ACK received by communication device 102 in response to a previous packet transmission by communication device 102.

Although step 506 may be an explicit determination, this disclosure contemplates communication device 102 (e.g., sensing engine 110) simply detecting the receipt of signal 118 (e.g., an ACK) in response to communication device 102 receiving signal 118.

If communication device 102 (e.g., sensing engine 110) determines at step 506 that signal 118 has not been received (e.g., does not detect receipt of signal 118) , then method 500 may return to step 506 to await receipt of a signal 118. If communication device 102 (e.g., sensing engine 110) determines at step 506 that signal 118 (e.g., an ACK) has been received, then method 500 may proceed to step 508.

At step 508, communication device 102 (e.g., sensing engine 110) determines a network state of network 104. The network state represents a congested condition of network 104. The network state may be determined based at least in part on the signal 118 (e.g., the ACK) received at step 506. Furthermore, the network state may represent the state of network 104 at the time at which signal 118 is received since the network state is determined from signal 118. The network state determined at step 508 may be considered the current network state of network 104 (e.g., at time t _c) .

In certain embodiments, determining a network state of network 104 includes determining values for one or more network state variables representing the network state. The network state may be represented using a suitable data structure. In certain embodiments, the network state (e.g., the first network state) is represented as a state vector representing a congested condition of network 104 when a new signal 118 (e.g., a new ACK) is received) . For example, the state vector may be a length-5 state vector, although this disclosure contemplates the state vector having any suitable length.

The data structure representing the state (e.g., the state vector) may include values for an EWMA of ACK inter-arrival time, the first signal being an ACK; an EWMA of packet inter-sending time; a ratio of current RTT and a minimum RTT; a slow start threshold; a congestion window (cwnd) size; or a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the cwnd size. Although particular example, values are described, this disclosure contemplates the network state of network 104 be represented in any suitable manner using any suitable values.

At step 510, communication device 102 (e.g., sensing engine 110) stores the determined network state. For example, communication device 102 (e.g., sensing engine 110) may store values representing the determined network state into the third network state variable (s ₂) , which may be the current network state (e.g., at time t _c) .

At step 512, using a utility function, communication device 102 (e.g., sensing engine 110) determines a second utility value. In certain embodiments, the utility function incorporates variables that may be reflected in the network state variable (e.g., values of the vector that represents the network state) such that determining the second utility value using the utility function includes determining the second utility value according to the current network state. As an example, the second utility value, as determined using the utility function, may be a function of throughput (tp) , delay (d) , and packet loss rate (p) . In certain embodiments, the utility function is:

At step 514, communication device 102 (e.g., sensing engine 110) stores the determined second utility value. For example, communication device 102 (e.g., sensing engine 110) may store the determined second utility value into the second utility variable (u ₂) . This second utility value may correspond to a time period from t ₀ up to the current time (t _c) .

At step 516, communication device 102 (e.g., sensing engine 110) determines whether the difference between a second time (e.g., a current time, t _c) and a first time (e.g., a value of t ₀) is greater than or equal to a round-trip time. For example, communication device 102 (e.g., sensing engine 110) may determine whether the difference between the current time (the value of t _c) and the value of t ₀ is greater than or equal to the EWMA of the round trip time of a packet in network 104. In certain embodiments, the second time is the current time, and the current time (the value of t _c) is the time at which signal 118 was received at step 506. On a first pass through method 500, the value of t ₀ is the value to which t ₀ was initialized (e.g., at step 504) . On later passes through method 500, the value of t ₀ may be updated, as described below. For example, on a later pass, the value of t ₀ may be the last time for which a reward was computed.

If communication device 102 (e.g., sensing engine 110) determines at step 516 that the difference between the second time (e.g., current time) and the first time (e.g., a value of t ₀) is not greater than or equal to the round trip time, then the method may proceed to step 518. At step 518, communication device 102 (e.g., actuator engine 114) may update the congestion window according to a current action (the value of a ₁) . For example, communication device 102 (e.g., actuator engine 114) may update a value of the cwnd variable according to the current value of a ₁. In one example, the update to the congestion window at step 518 may be expressed as cwnd = cwnd + a ₁/cwnd, where the value of a ₁ is the current action. The method may then return to step 506 to await receipt of another signal 118 (e.g., another ACK) . If communication device 102 continues to receive signals 118 without a round-trip time passing (e.g., with t _c–t ₀ ≥ RTT being true) , then communication device 102 may update the value of the cwnd variable with the current value of the second action variable (a ₁) multiple times prior to updating the value of the cwnd variable based on a newly-determined (e.g., third) action (although the newly-determined third action might be the same as or different than the current value of a ₁) .

Returning to step 516, if communication device 102 (e.g., sensing engine 110) determines at step 516 that the difference between the second time (e.g., current time) and the first time (e.g., the value of t ₀) is greater than or equal to the round trip time, then the method may proceed to step 520.

At step 520, communication device 102 (e.g., sensing engine 110) determines, based on the second utility value of the second utility variable (u ₂) and a first utility value of the first utility variable (u ₁) , a reward. The reward may correspond to a first action represented by the value of the first action variable (a ₀) . Thus, in this example, communication device 102 (sending engine 110) determines the reward based at least in part on determining (at step 516) that a difference between the current time and the value of t ₀ is greater than or equal to a round-trip time.

On a first pass through method 500, the value of u ₁ is the value to which u ₁ was initialized (e.g., at step 504) . On later passes through method 500, the value of u ₁ may be updated, as described below. In certain embodiments, the first utility value of the first utility variable (u ₁) is associated with a first time period from a first time (atime that is at least a RTT prior to t ₀ in this example) to a second time (t ₀ in this example) , and the second utility value of the second utility variable (u ₂) is associated with a second time period from the second time (t ₀ in this example) to a third time (the current time, t _c, in this example) . The first time period with which the first utility value of the first utility variable (u ₁) is associated may correspond to a first round trip time, and the second time period with which the second utility value of the second utility variable (u ₂) is associated may correspond to a second round trip time. In this example, the difference between the second time and the first time (the first time period) is at least a round-trip time of a first packet 120 in network 104, and the difference between the third time and the second time is at least a round-trip time of a second packet 120 in network 104.

In certain embodiments, such as on a pass other than the initial pass through method 500, the first utility value of the first utility variable (u ₁) may be calculated using the utility function (as opposed to simply being initialized to a value of zero) . In certain embodiments, the utility function used to determine the preceding utility (e.g., the first utility value of the first utility variable (u ₁) ) is the same utility function that was used to determine the second network utility (e.g., the second utility value of the second utility variable (u ₂) ) at step 512.

Δ _n+1=U _n+1-U _n,

where difference being determined, U _n+1 is the second utility variable, and U _n is the first (e.g., preceding) utility variable. Furthermore, the particular value for the reward may be determined according to the following conditional:

where r _n+1 is the reward being determined.

At step 522, communication device 102 (e.g., learner engine 112) may update a first value function indicating a desirability value associated with values for a first action variable (a ₀) and a first network state variable (s ₀) of communication network 104. For example, in an initial pass, the values for variables a ₀ and s ₀ may be the values to which variables a ₀ and s ₀ were initialized at step 504. As another example, in a later pass, the values for variables a ₀ and s ₀ may be for a first action at a first time (e.g., at the beginning of the first time period described above with reference to step 520) and the first network state at the first time. The first value function may be updated according to the reward (determined at step 520) and a second value function indicating a desirability value associated with values for a second action variable (a ₁) and a second network state variable (s ₁) of communication network 104. In certain embodiments, the value function is a Q-function that is trained using a SARSA temporal difference learning algorithm.

As one particular example, the value function is updated as follows:

where

means y ₁= (1-α) y ₁+αy ₂; s _i, a _i, r _i are values for state, action, and reward variables computed at the beginning of time step i; α _t is the learning rate as a function of time t (e.g., in seconds) ; and γ is the discount factor. The time n-1 may be the first time, n may be the second time; and n+1 may be the third (e.g., current) time. Thus, the above value function may be expressed as follows:

Learner engine 112 may communicate the value function to actuator engine 114.

At step 524, communication device 102 (e.g., actuator engine 114) may determine, according to the value function updated at step 522 (and the associated desirability value) , an action of the plurality of actions for the value of the third network state variable. The determined action may be stored as the value of the third action variable (a ₂) . The action (as represented by the value of the third action variable (a ₂) ) might be the same action as or a different action than the first and second actions (as represented by the values of the first action variable (a ₀) and the second action variable (a ₁) ) , which also might be the same as or different than one another. In other words, the first action might be the same as the second action; the first action might be the same as the third action; the second action might be the same as the third action; or the first action, the second action, and the third action might all be the same action. Additionally or alternatively, some or all of the first, second, and third actions might be different from one another. In certain embodiments, communication device 102 (e.g., actuator engine 114) determines the third action according to an ∈-greedy E2 scheme. This third action may be determined for the current network state (e.g., the network state determined at step 508) .

At step 526, communication device 102 (e.g., actuator engine 114) stores the determined third action. For example, communication device 102 (e.g., sensing engine 110) may store the determined third action as the third action (a ₂) .

At step 528, communication device 102 (e.g., actuator engine 114) updates the reinforcement learning parameters. In certain embodiments, communication device 102 (e.g., actuator engine 114) updates the reinforcement learning parameters such that values assigned to the parameters are advanced one time step. For example, communication device 102 (e.g., learner engine 112) may update the above-discussed reinforcement learning parameters as follows:

At step 530, communication device 102 determines whether to terminate method 500. For example, communication device 102 may determine whether to terminate communication of packets 120, receipt of signals 118, congestion control, RL-TCP congestion control, or some combination of these. If communication device 102 determines at step 530 not to terminate method 500, at step 518, communication device 102 (e.g., actuator engine 114) may update the congestion window according to the current action (a ₁) (e.g., with the newly assigned value from step 528) . For example, communication device 102 (e.g., actuator engine 114) may update a value of the cwnd variable according to the current value of a ₁. In one example, the update to the congestion window at step 518 may be expressed as cwnd = cwnd + a ₁/cwnd, where the value of the second action variable (a ₁) is the current action. The method may then return to step 506 to await receipt of another signal 118 (e.g., another ACK) . Method 500 may then return to step 506 to await a new signal 118 (e.g., a new ACK) .

If, on the other hand, communication device 102 determines at step 530 to terminate method 500, then at step 532 the method may end.

Thus, an example of method 500 will now be described in which at least two iterations through method 500 have occurred. The first iteration may correspond to a first (though not necessarily the initial) time period, and the second iteration may correspond to a second time period that is subsequent to the first time period. The first time period may be from a first time to a second time, and the second time period may be from the second time to a third time. For purposes of this example, it will be assumed that the current time is the third time.

Continuing with this example, method 500 may include determining, by communication device 102 according to a third network state (s ₂) (e.g., a network state determined at the end of a second time window (e.g., at a third time) ) , a second utility value (u ₂) using the utility function. The second utility value corresponds to the second time period, which is from the second time to a third time (e.g., from t ₀ to t _c during a second iteration through method 500, with updated values for t ₀ and t _c) . The third network state is determined based on a third signal 118 received from network 104 at the third time.

Communication device 102 may have determined, at the second time (e.g., the end of the first time period) , according to a second network state (s ₁) (e.g., a network state determined at the end of the first time period (e.g., at a second time) ) , a first utility value (u ₁) using the utility function. The first utility value corresponds to the first time period, which is from a first time to a second time (e.g., from t ₀ to t _c during a first iteration through method 500) . The second network state is determined based on a second signal 118 received from network 104 at the second time. In a previous iteration, communication device 102 may have determined a first network state based on a first signal 118 received from network 104 at the first time.

Continuing with this example, method 500 may include determining, by communication device 102 and based on the first utility value (u ₁) and the second utility value (u ₂) , a reward. This reward may correspond to an action implemented by communication device 102 at the first time (e.g., the beginning of the first time period) .

Continuing with this example, method 500 may include updating, by communication device 102, a first value function (Q (s ₀, a ₀) ) indicating a first desirability value associated with the first network state (s ₀) and a first action (a ₀) . The first value function (Q (s ₀, a ₀) ) may be updated according to the reward and a second value function (Q (s ₁, a ₁) ) indicating a second desirability value associated with the second network state (s ₁) and a second action (a ₁) .

Continuing with this example, method 500 may include determining, by communication device 102 and according to the first value function (Q (s ₀, a ₀) ) , a third action (a ₂) . The first action (a ₀) , the second action (a ₁) , and the third action (a ₂) are actions of a plurality of actions, and each action of the plurality of actions includes modifying a value of a congestion window (cwnd) variable by a respective amount. The value of the cwnd variable represents a size of the cwnd.

Continuing with this example, method 500 may include updating the cwnd variable based on the third action (e.g., using a ₁ at step 518 as updated with a ₂ at step 526) .

As will be described in connection with the following figures, the performance of the disclosed learning-based TCP congestion control protocols can be compared with that of TCP NewReno and Q-learning TCP in NS2. It should be understood that the values, measurements, and other results and conclusions illustrated and described below are for example purposes only and only reflect certain embodiments of this disclosure.

FIGURE 6 illustrates an example dumbbell topology 600, according to certain embodiments of this disclosure. Topology 600 includes K senders 602, K receivers 604, and

routers

606a and 606b. In certain embodiments, senders 602 and receivers 604 are network devices that are similar to network device 102 in FIGURE 1; however, this disclosure contemplates senders 602 and receivers 604 being any suitable types of network devices that are capable of communicating electronically with one another, using TCP for example.

Senders 602 are connected to router 606a via links 608a, and receivers are connected to router 606b via links 608b, which may be part of network 104 for example.

Routers

606a and 606b are coupled via a link 610. Link 610 can be considered a bottleneck link of bandwidth B Mb/sand one-way trip time D milliseconds (ms) .

The buffer size at router 606a is L, which represents a number of packets. A network setup with K senders/receivers and buffer size L is denoted by (L, K) . The minimum round-trip time between each sender 602 and receiver 604 is RTT _min. By default, for purposes of this example, it will be assumed that bottleneck link 610 has the following characteristics: B = 10 Mb/s, D = 74 ms, and RTT _min = 150ms. Thus, the bandwidth delay product (BPD) is 150 packets. As shown in FIGURE 4, links 608 have the following characteristics: B = 1 Gb/sand D = 1 ms. The bandwidth and delay for links 608 being greater than the bandwidth and delay for link 610 provides one example of how link 610 can create a bottleneck for transmissions from senders 602 to receivers 604. Additionally, for purposes of this example, the default traffic type for each sender 602 is “always-on. ”

In the following description, performance of TCP congestion control schemes is evaluated in two main scenarios:

● Single sender with varying buffer size L = 5, 50, and 150; and

● Four senders sharing bottleneck link 610 with buffer size L = 50.

The metrics used include throughput (tp) , delay (d) , packet loss rate (p) , and utility (U) . The throughput of a flow may be measured by dividing the total amount of bytes received by the time during which the sender was active) . Average delay of a flow can be computed as d = RTT –RTT _min. Packet loss rate can be computed as follows:

Utility can be computed as U = log (tp) -log (RTT) . E (·) represents expectation, and V (·) represents variance. Unless otherwise stated, the results are averaged over 100 iterations, each of which lasts approximately 400 seconds. The unit of measure for throughput is Mb/s, and the unit of measure for delay is ms. E (·) represents expectation, and V (·) represents variance.

Table 3 below lists the TCP congestion control schemes with which the LP-TCP and RL-TCP congestion control schemes according to embodiments of this disclosure are compared. For the RL-TCP congestion control scheme, α _t is set to be a function of time t (in seconds) as α _t = 0.3 *0.995 ^[t/10] , and the discount factor γ is set to 0.5. For purposes of this example, the variable α _t decays to zero as t goes to infinity.

Table 3: TCP CC schemes for comparisons.

For the E2 scheme, an ε-greedy exploration is used, which means that the agent (e.g., communication device 102) takes a random action at a probability ε, where ε = 0.1, otherwise the best action at that moment. In this example, the actions that the agent (e.g., communication device 102) takes are adjusting the congestion window by one of the following values {-1, 0, 1, +3} . The five states are each discretized into 10 intervals along the ranges in which the states frequently lie.

Single Sender

This section describes the evaluation of the performance of the LP-TCP and RL-TCP congestion control mechanisms when there is only a single sender with “always-on” traffic in dumbbell network 600. First, the impact of credit assignment in RL-based TCP congestion control schemes is described.

Effect of credit assignment in RL-based TCP congestion control schemes: In Table 4, below, the computed throughput and delay of a single “always-on” sender using various TCP congestion control schemes over 10 iterations is provided. For RL-TCP, δ1 = 0.01, δ2 = 0.1. The proposed credit assignment, throughput, and/or delay are improved, and a better M _e is achieved.

	E (tp)	V (tp)	E (d)	V (d)	M _e
Q-TCP	6.176	0.2674	16.26	4.6624	1.541
Q-TCP _ca	9.597	8.7210e-3	20.31	3.6909	1.960
Q _a-TCP	9.658	0.0191	14.8	2.818	1.998
Q _a-TCP _ca	9.857	8.100e-5	3.74	3.240e-2	2.156
RL-TCP _no-ca	9.723	9.3010e-3	13.87	3.1521	2.011
RL-TCP	9.869	7.490e-4	3.86	3.240e-2	2.154

Table 4: Effect of credit assignment for Q-TCP, Q _a-TCP and RL-TCP. Buffer size is 50 packets.

FIGURE 7 illustrates the changes of congestion window size for Q-TCP, Q-TCPca, RL-TCPno-ca, and RL-TCP congestion control schemes during one simulation, according to certain embodiments of this disclosure. With an action space tailored to under-buffered networks and the proposed credit assignment, the RL-TCP congestion control scheme learns quickly, while the Q-TCP congestion control scheme encounters frequent time-outs due to large congestion window size modifications in its original action space, which may not be suitable for under-buffered situations. With the proposed credit assignment, the Q-TCPca congestion control scheme learns and rescues from the consequences of unfavorable actions at some states, but still encounters time-out from time to time.

Varying the buffer size (L) may have an impact. Table 5 below shows example results of various TCP congestion control schemes when the buffer size L at router 406a is varied.

	L	E (tp)	V (tp)	E (d)	V (d)	p	M _e
NewReno	5	7.490	0	1.900	0	1.4e-4	1.949
Q-TCP	5	0.652	1.32e-2	3.101	1.14e-2	4.8e-2	-0.541
Q _a-TCP	5	5.262	1.08e-1	2.504	6.38e-3	1.05e-3	1.568
LP-TCP	5	9.675	2.5e-5	4.5	9.0e-2	2.319e-5	2.119
RL-TCP	5	7.666	8.03e-2	2.514	5.80e-3	2.9e-4	1, 944
NewReno	50	8.910	0	15.98	1.60e-3	2.6e-4	1.910
Q-TCP	50	6.198	3.38e-1	17.24	5.750	3.11e-3	1.539
Q _a-TCP	50	9.503	3.68e-2	16.10	4.113	2.1e-4	1.973
LP-TCP	50	8.91	0	16.54	3.04e-2	2.645e-4	1.906
RL-TCP	50	9.762	9.96e-4	3.885	4.807e-2	1.8e-4	2.142
NewReno	150	9.920	0	82.50	0	7.8e-4	1.853
Q-TCP	150	8.873	8.62e-2	71.91	40.53	1.46e-3	1.755
Q _a-TCP	150	9.909	4.74e-4	29.58	91.52	7.6e-4	1.954
LP-TCP	150	9.920	0	82.78	7.6e-3	7.867e-4	1.852
RL-TCP	150	9.919	4.75e-6	4.041	4.92e-2	7.5e-4	2.154

Table 5: Average performance of a single TCP sender using NewReno, Q-TCP, Q _a-TCP, LP-TCP, or RL-TCP, with varying buffer size, L.

As shown in Table 5, in some scenarios the LP-TCP congestion control scheme benefits most from a small buffer size, L: when compared with NewReno at L = 5, the LP-TCP congestion control scheme shows a 29%increase on throughput with RTT close to RTT _min. In this example, when buffer size, L, increases, the LP-TCP congestion control scheme selects a large th (see Table 1 above) for a better M _e, and performs similarly to NewReno. In this example, when L = 50, the RL-TCP congestion control scheme achieves 7-8%decrease in RTT and at least 9%increase in throughput, compared to NewReno and Q-TCP. Table 5 suggests that the LP-TCP congestion control scheme can fully use a 10Mb/sbottleneck link with a buffer size as small as five packets. Table 5 also suggest that the RL-TCP congestion control scheme may use a larger buffer size (e.g., L = 50) to fully use a 10MB/s bottlenetck link, but the RL-TCP congestion control scheme also fully utilizes the bandwidth while keeping RTT close to RTT _min. However, NewReno needs buffer size of 150 packets, while having delay as high as 82ms.

Multiple Senders

This section considers scenarios in which there are multiple senders 602 in dumbbell network 600, including “always-on” senders and “on-off” senders.

In a first scenario, four senders 602 are considered to be “always-on” senders. Table 6 shows the performance per sender when all senders use the same TCP congestion control scheme. For the RL-TCP congestion control scheme, δ1 = 0.2, δ2 = 0.1. In this example, compared to NewReno, the LP-TCP congestion control scheme improves throughput by 5%with slightly increased delay, and the RL-TCP congestion control scheme increases throughput by 5%with a similar delay. In this example, the RL-TCP congestion control scheme provides the best throughput and delay tradeoff in terms of M _e, albeit having a slightly higher variance of its performance (again, in this example) .

	E (tp)	V (tp)	E (d)	V (d)	p	M _e
NewReno	2.296	3.58e-4	17.34	9.673e-3	1.43e-3	0.545
Q-TCP	1.836	2.96e-2	20.37	8.094e-1	1.888e-2	0.306
Q _a-TCP	2.290	4.32e-2	20.30	7.483e-1	2.33e-3	0.527
LP-TCP	2.412	2.931e-2	24.04	2.189e-1	1.33e-3	0.562
RL-TCP	2.409	1.17e-1	17.67	8.183	8.2e-4	0.592

Table 6: Average performance per sender when four “always-on” senders that adopt the same CC scheme coexist in the bottleneck link with L=50.

In a second scenario, four senders 602 are considered to be “on-off” senders. In this example, when a sender is on, the sender sends a number of bytes according to an exponential distribution with a mean of 10MB. Then, the sender turns off for a time period according to an exponential distribution with a mean of 0.5s. Table 7 below shows the average performance per sender when all senders adopt the same TCP congestion control scheme. For this example, the buffer size (L) is 50 packets. For the RL-TCP congestion control scheme, δ1 = 0.4, δ2 = 0.1.

	E (tp)	V (tp)	E (d)	V (d)	p	M _e
NewReno	2.305	6.18e-3	16.42	1.003e-1	1.43e-3	0.555
Q-TCP	1.837	2.722e-2	19.91	6.668e-1	1.813e-2	0.309
Q _a-TCP	2.296	2.767e-2	19.20	5.945e-1	2.194e-3	0.535
LP-TCP	2.394	3.126e-2	21.56	3.677e-1	1.343e-3	0.565
RL-TCP	2.399	4.234e-2	20.21	5.871	8.277e-4	0.574

Table 7: Average performance per sender when four “on-off” senders that adopt the same CC scheme coexist in the bottleneck link with L=50.

As shown in Table 7, the Q-TCP congestion control scheme does not perform well in this dynamic situation. The Q _a-TCP congestion control scheme in this example, however, arguably performs better than the Q-TCP congestion control scheme, with a throughput comparable to that of the NewReno congestion control scheme, and a slightly increased delay. The LP-TCP congestion control scheme achieves a higher throughput and lower packet loss rate than the NewReno congestion control scheme, also at the expense of slightly increased delay in this particular example. In this example, the RL-TCP congestion control scheme achieves the highest M _e, with a slightly larger variance in its performance.

Two learning-based TCP congestion control protocols for a wired network, the LP-TCP and RL-TCP congestion control schemes, have been disclosed. The performance of the disclosed TCP congestion control schemes was compared with that of the NewReno congestion control scheme in NS2. With proper training, the RL-TCP congestion control scheme shows superior performance compared to the TCP NewReno congestion control scheme, while the LP-TCP congestion control scheme outperforms NewReno on average throughput and packet loss rate. In certain embodiments, the LP-TCP and RL-TCP congestion control schemes improve network bottleneck bandwidth usage and thus communication efficiency, which might indicate more income for network providers.

FIGURE 8 is a block diagram of an embodiment processing system 900 for performing methods described herein, which may be installed in a host device, according to certain embodiments of this disclosure. As shown, the processing system 900 includes a processor 904, a memory 906, and interfaces 910-914, which may (or may not) be arranged as shown in the figure. The processor 904 may be any component or collection of components adapted to perform computations and/or other processing related tasks, and the memory 906 may be any component or collection of components adapted to store programming and/or instructions for execution by the processor 904. In an embodiment, the memory 906 includes a non-transitory computer readable medium. The

interfaces

910, 912, 914 may be any component or collection of components that allow the processing system 900 to communicate with other devices/components and/or a user. For example, one or more of the

interfaces

910, 912, 914 may be adapted to communicate data, control, or management messages from the processor 904 to applications installed on the host device and/or a remote device. As another example, one or more of the

interfaces

910, 912, 914 may be adapted to allow a user or user device (e.g., personal computer (PC) , etc. ) to interact/communicate with the processing system 900. The processing system 900 may include additional components not depicted in the figure, such as long term storage (e.g., non-volatile memory, etc. ) .

In some embodiments, the processing system 900 is included in a network device that is accessing, or part otherwise of, a telecommunication network. In one example, the processing system 900 is in a network-side device in a wireless or wireline telecommunication network, such as a base station, a relay station, a scheduler, a controller, a gateway, a router, an applications server, or any other device in the telecommunication network. In other embodiments, the processing system 900 is in a user-side device accessing a wireless or wireline telecommunication network, such as a mobile station, a user equipment (UE) , a personal computer (PC) , a tablet, a wearable communication device (e.g., a smartwatch, etc. ) , or any other device adapted to access a telecommunication network.

In some embodiments, one or more of the

interfaces

910, 912, 914 connects the processing system 900 to a transceiver adapted to transmit and receive signaling over the telecommunication network.

FIGURE 9 is a block diagram of a transceiver 1000 adapted to transmit and receive signaling over a telecommunication network, according to certain embodiments of this disclosure. The transceiver 1000 may be installed in a host device. As shown, the transceiver 1000 comprises a network-side interface 1002, a coupler 1004, a transmitter 1006, a receiver 1008, a signal processor 1010, and a device-side interface 1012. The network-side interface 1002 may include any component or collection of components adapted to transmit or receive signaling over a wireless or wireline telecommunication network. The coupler 1004 may include any component or collection of components adapted to facilitate bi-directional communication over the network-side interface 1002. The transmitter 1006 may include any component or collection of components (e.g., up-converter, power amplifier, etc. ) adapted to convert a baseband signal into a modulated carrier signal suitable for transmission over the network-side interface 1002. The receiver 1008 may include any component or collection of components (e.g., down-converter, low noise amplifier, etc. ) adapted to convert a carrier signal received over the network-side interface 1002 into a baseband signal. The signal processor 2410 may include any component or collection of components adapted to convert a baseband signal into a data signal suitable for communication over the device-side interface (s) 1012, or vice-versa. The device-side interface (s) 1012 may include any component or collection of components adapted to communicate data-signals between the signal processor 1010 and components within the host device (e.g., the processing system 900, local area network (LAN) ports, etc. ) .

The transceiver 1000 may transmit and receive signaling over any type of communications medium. In some embodiments, the transceiver 1000 transmits and receives signaling over a wireless medium. For example, the transceiver 1000 may be a wireless transceiver adapted to communicate in accordance with a wireless telecommunications protocol, such as a cellular protocol (e.g., long-term evolution (LTE) , etc. ) , a wireless local area network (WLAN) protocol (e.g., Wi-Fi, etc. ) , or any other type of wireless protocol (e.g., Bluetooth, near field communication (NFC) , etc. ) . In such embodiments, the network-side interface 1002 comprises one or more antenna/radiating elements. For example, the network-side interface 1002 may include a single antenna, multiple separate antennas, or a multi-antenna array configured for multi-layer communication, e.g., single input multiple output (SIMO) , multiple input single output (MISO) , multiple input multiple output (MIMO) , etc. In other embodiments, the transceiver 1000 transmits and receives signaling over a wireline medium, e.g., twisted-pair cable, coaxial cable, optical fiber, etc. Specific processing systems and/or transceivers may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device.

It should be appreciated that one or more steps of the embodiment methods provided herein may be performed by corresponding units or modules. For example, a signal may be transmitted by a transmitting unit or a transmitting module. A signal may be received by a receiving unit or a receiving module. A signal may be processed by a processing unit or a processing module. The respective units/modules may be hardware, software, or a combination thereof. For instance, one or more of the units/modules may be an integrated circuit, such as field programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs) .

Although the various systems and components described throughout this disclosure are described and illustrated as including particular components arranged in a particular manner, this disclosure contemplates these systems and components including additional components, fewer components, different components, or a combination of these. Additionally, each of the systems and components described throughout this disclosure may be implemented using any suitable combination of hardware, firmware, and software.

Although this disclosure describes particular components as performing particular operations, this disclosure contemplates other components performing those operations. Additionally, although this disclosure describes or illustrates particular operations as occurring in a particular order, this disclosure contemplates any suitable operations occurring in any suitable order. Moreover, this disclosure contemplates any suitable operations being repeated one or more times in any suitable order. Although this disclosure describes or illustrates particular operations as occurring in sequence, this disclosure contemplates any suitable operations occurring at substantially the same time, where appropriate. Any suitable operation or sequence of operations described or illustrated herein may be interrupted, suspended, or otherwise controlled by another process, such as an operating system or kernel, where appropriate. The acts can operate in an operating system environment or as stand-alone routines occupying all or a substantial part of the system processing.

It may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The terms “include” and “comprise, ” as well as derivatives thereof, mean inclusion without limitation. The term “or” is inclusive, meaning and/or. The phrases “associated with” and “associated therewith, ” as well as derivatives thereof, mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like.

While this invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. It is therefore intended that the appended claims encompass any such modifications or embodiments.

Claims

A method for Transmission Control Protocol (TCP) congestion control, the method comprising:

determining, by a communication device and based on a first signal received from a communication network, a first network state;

determining, by a loss predictor of the communication device and according to the first network state, a first loss prediction for a first packet, the first loss prediction indicating a likelihood that the first packet will be lost due to network congestion if the first packet is transmitted via the communication network; and

determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network.
The method of Claim 1, further comprising determining, based on the first loss prediction, whether to increase or decrease a value of a congestion window variable, the value of the congestion window variable indicating a size of a congestion window.
The method as in any of Claims 1-2, wherein the first loss prediction comprises a first loss probability indicating an estimate of a probability that the first packet will be lost if the first packet is transmitted via the communication network.
The method of Claim 3, wherein determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network comprises:

determining that the first loss probability does not exceed a loss probability threshold; and

determining, based at least on determining that the first loss probability does not exceed the loss probability threshold, to transmit the first packet via the communication network.
The method as in any of Claims 1-4, further comprising transmitting, in response to determining to transmit the first packet via the communication network, the first packet via the communication network.
The method as in any of Claims 1-5, further comprising:

determining, by the communication device and based on a second signal received from the communication network, a second network state;

determining, by the communication device and according to the second network state, a second loss prediction for a second packet, the second loss prediction indicating a likelihood that the second packet will be lost if the second packet is transmitted via the communication network; and

determining, by the communication device and based at least on the second loss prediction, not to transmit the second packet via the communication network.
The method of Claim 6, wherein:

the second loss prediction comprises a second loss probability indicating an estimate of a probability that the second packet will be lost if the second packet is transmitted via the communication network; and

determining, by the communication device and based at least on the second loss prediction, not to transmit the second packet via the communication network comprises:

determining that the second loss probability exceeds a loss probability threshold; and

determining, based at least on determining that the second loss probability exceeds the loss probability threshold, not to transmit the second packet via the communication network.
The method of Claim 7, further comprising decreasing, based at least on determining that the second loss probability exceeds the loss probability threshold, a value of a congestion window variable, the value of the congestion window variable indicating a size of a congestion window.
The method as in any of Claims 1-8, further comprising performing supervised training of the loss predictor.
The method of Claim 9, wherein performing supervised training of the loss predictor comprises:

collecting, through simulations performed on a network simulator, training data for training the loss predictor;

training a model according to the training data;

correlating network states to respective probabilities of loss according to the trained model.
The method as in any of Claims 1-10, wherein the first signal is an acknowledgement (ACK) received by the communication device in response to a previous packet transmission by the communication device.
The method as in any of Claims 1-11, wherein the first network state is represented as a state vector representing the congested condition of the network.
The method of Claim 12, wherein the state vector comprises values for:

an exponentially weighted moving average (EWMA) of acknowledgement (ACK) inter-arrival time, the first signal being an ACK;

an EWMA of packet inter-sending time;

a ratio of current round-trip time (RTT) and a minimum RTT;

a slow start threshold;

a congestion window size; or

a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the congestion window size.
A system for Transmission Control Protocol (TCP) congestion control, the system comprising:

a non-transitory memory storage comprising instructions; and

one or more processors in communication with the memory storage, wherein the one or more processors are configured to execute the instructions to:

determine, based on a first signal received from a communication network, a first network state;

determine, according to the first network state, a first loss prediction for a first packet, the first loss prediction indicating a likelihood that the first packet will be lost due to network congestion if the first packet is transmitted via the communication network; and

determine, based at least on the first loss prediction, to transmit the first packet via the communication network.
The system of Claim 14, wherein the one or more processors are further configured to execute the instructions to determine, based on the first loss prediction, whether to increase or decrease a value of a congestion window variable, the value of the congestion window variable indicating a size of a congestion window.
The system as in any of Claims 14-15, wherein the first loss prediction comprises a first loss probability indicating an estimate of a probability that the first packet will be lost if the first packet is transmitted via the communication network.
The system of Claim 16, wherein determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network comprises:

determining that the first loss probability does not exceed a loss probability threshold; and

determining, based at least on determining that the first loss probability does not exceed the loss probability threshold, to transmit the first packet via the communication network.
The system as in any of Claims 14-17, wherein the one or more processors are further configured to execute the instructions to transmit, in response to determining to transmit the first packet via the communication network, the first packet via the communication network.
The system as in any of Claims 14-18, wherein the one or more processors are further configured to execute the instructions to:

determine, by the communication device and based on a second signal received from the communication network, a second network state;

determine, by the communication device and according to the second network state, a second loss prediction for a second packet, the second loss prediction indicating a likelihood that the second packet will be lost if the second packet is transmitted via the communication network; and

determine, by the communication device and based at least on the second loss prediction, not to transmit the second packet via the communication network.
The system of Claim 19, wherein:

the second loss prediction comprises a second loss probability indicating an estimate of a probability that the second packet will be lost if the second packet is transmitted via the communication network; and

determining, by the communication device and based at least on the second loss prediction, not to transmit the second packet via the communication network comprises:

determining that the second loss probability exceeds a loss probability threshold; and

determining, based at least on determining that the second loss probability exceeds the loss probability threshold, not to transmit the second packet via the communication network.
The system of Claim 20, wherein the one or more processors are further configured to execute the instructions to decrease, based at least on determining that the second loss probability exceeds the loss probability threshold, a value of a congestion window variable, the value of the congestion window variable indicating a size of a congestion window.
The system as in any of Claims 14-21, wherein the one or more processors are further configured to execute the instructions to perform supervised training of the loss predictor.
The system of Claim 22, wherein performing supervised training of the loss predictor comprises:

collecting, through simulations performed on a network simulator, training data for training the loss predictor;

training a model according to the training data;

correlating network states to respective probabilities of loss according to the trained model.
The system as in any of Claims 14-23, wherein the first signal is an acknowledgement (ACK) received by the communication device in response to a previous packet transmission by the communication device.
The system as in any of Claims 14-24, wherein the first network state is represented as a state vector representing the congested condition of the network.
The system of Claim 25, wherein the state vector comprises values for:

an exponentially weighted moving average (EWMA) of acknowledgement (ACK) inter-arrival time, the first signal being an ACK;

an EWMA of packet inter-sending time;

a ratio of current round-trip time (RTT) and a minimum RTT;

a slow start threshold;

a congestion window size; or

a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the congestion window size.
A non-transitory computer-readable media storing computer instructions for Transmission Control Protocol (TCP) congestion control, that when executed by one or more processors, cause the one or more processors to perform operations comprising:

determining, by a communication device and based on a first signal received from a communication network, a first network state;

determining, by a loss predictor of the communication device and according to the first network state, a first loss prediction for a first packet, the first loss prediction indicating a likelihood that the first packet will be lost due to network congestion if the first packet is transmitted via the communication network; and

determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network.
The non-transitory computer-readable media of Claim 27, wherein the operations further comprise determining, based on the first loss prediction, whether to increase or decrease a value of a congestion window variable, the value of the congestion window variable indicating a size of a congestion window.
The non-transitory computer-readable media as in any of Claims 27-28, wherein the first loss prediction comprises a first loss probability indicating an estimate of a probability that the first packet will be lost if the first packet is transmitted via the communication network.
The non-transitory computer-readable media of Claim 29, wherein determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network comprises:

determining that the first loss probability does not exceed a loss probability threshold; and

determining, based at least on determining that the first loss probability does not exceed the loss probability threshold, to transmit the first packet via the communication network.
The non-transitory computer-readable media as in any of Claims 27-30, wherein the operations further comprise transmitting, in response to determining to transmit the first packet via the communication network, the first packet via the communication network.
The non-transitory computer-readable media as in any of Claims 27-31, wherein the operations further comprise:

determining, by the communication device and based on a second signal received from the communication network, a second network state;

determining, by the communication device and according to the second network state, a second loss prediction for a second packet, the second loss prediction indicating a likelihood that the second packet will be lost if the second packet is transmitted via the communication network; and

determining, by the communication device and based at least on the second loss prediction, not to transmit the second packet via the communication network.
The non-transitory computer-readable media of Claim 32, wherein:

the second loss prediction comprises a second loss probability indicating an estimate of a probability that the second packet will be lost if the second packet is transmitted via the communication network; and

determining, by the communication device and based at least on the second loss prediction, not to transmit the second packet via the communication network comprises:

determining that the second loss probability exceeds a loss probability threshold; and

determining, based at least on determining that the second loss probability exceeds the loss probability threshold, not to transmit the second packet via the communication network.
The non-transitory computer-readable media of Claim 33, wherein the operations further comprise decreasing, based at least on determining that the second loss probability exceeds the loss probability threshold, a value of a congestion window variable, the value of the congestion window variable indicating a size of a congestion window.
The non-transitory computer-readable media as in any of Claims 27-34, wherein the operations further comprise performing supervised training of the loss predictor.
The non-transitory computer-readable media of Claim 35, wherein performing supervised training of the loss predictor comprises:

collecting, through simulations performed on a network simulator, training data for training the loss predictor;

training a model according to the training data;

correlating network states to respective probabilities of loss according to the trained model.
The non-transitory computer-readable media as in any of Claims 27-36, wherein the first signal is an acknowledgement (ACK) received by the communication device in response to a previous packet transmission by the communication device.
The non-transitory computer-readable media as in any of Claims 27-37, wherein the first network state is represented as a state vector representing the congested condition of the network.
The non-transitory computer-readable media of Claim 38, wherein the state vector comprises values for:

an exponentially weighted moving average (EWMA) of acknowledgement (ACK) inter-arrival time, the first signal being an ACK;

an EWMA of packet inter-sending time;

a ratio of current round-trip time (RTT) and a minimum RTT;

a slow start threshold;

a congestion window size; or

a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the congestion window size.
A system for Transmission Control Protocol (TCP) congestion control, the system comprising:

means for determining, by a communication device and based on a first signal received from a communication network, a first network state;

means for determining, by a loss predictor of the communication device and according to the first network state, a first loss prediction for a first packet, the first loss prediction indicating a likelihood that the first packet will be lost due to network congestion if the first packet is transmitted via the communication network; and

means for determining, by the communication device and based at least on the first loss prediction, to transmit the first packet via the communication network.
A method for Transmission Control Protocol (TCP) congestion control, the method comprising:

determining, by a communication device and based on a first utility value and a second utility value, a reward for a first action, wherein:

the first utility value is determined using a utility function and corresponds to a first time period from a first time to a second time;

the second utility value is determined using the utility function and corresponds to a second time period from the second time to a third time; and

the first action corresponds to the first time and is one of a plurality of actions, each action of the plurality of actions comprising modifying a value of a congestion window variable for a communication network by a respective amount, the value of the congestion window variable representing a size of the congestion window;

updating, by the communication device, a first value function indicating a first desirability value associated with a first network state and the first action, the first value function updated according to the reward and a second value function indicating a second desirability value associated with a second network state and a second action of the plurality of actions, the first network state corresponding to the first time, the second network state and the second action corresponding to the second time;

determining, by the communication device and according to the updated first value function, a third action of the plurality of actions; and

updating the congestion window variable based on the third action.
The method of Claim 41, wherein:

a difference between the second time and the first time is at least a round trip time of a first packet in the communication network; and

a difference between the third time and the second time is at least a round trip time of a second packet in the communication network.
The method as in any of Claims 41-42, wherein determining, by the communication device and based on the first utility value and the second utility value, the reward is performed based at least in part on determining that a difference between the third time and the second time is greater than or equal to a round trip time.
The method as in any of Claims 41-43, further comprising updating the congestion window variable at least twice prior to updating the congestion window variable based on the third action.
The method as in any of Claims 41-44, wherein one of the following is true:

the first action is the same as the second action;

the first action is the same as the third action;

the second action is the same as the third action; and

the first action, the second action, and the third action are the same.
The method as in any of Claims 41-45, wherein one of the following is true:

the first action is different than the second action;

the first action is different than the third action;

the second action is different than the third action; and

the first action, the second action, and the third action are different.
The method as in any of Claims 41-46, wherein the first, second, and third network states are each represented as respective state vectors representing a congested condition of the communication network.
The method of Claim 47, wherein each state vector comprises values for:

an exponentially weighted moving average (EWMA) of acknowledgement (ACK) inter-arrival time;

an EWMA of packet inter-sending time;

a ratio of current round-trip time (RTT) and a minimum RTT;

a slow start threshold;

a current value of the congestion window variable; or

a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the current value of the congestion window variable.
The method as in any of Claims 41-48, wherein the second utility value, as determined using the utility function, is a function of throughput, delay, and packet loss rate.
The method of Claim 49, wherein the utility function is:

wherein tp is throughput, B is a bandwidth of a bottleneck in the communication network, d is a delay calculated as a difference between a current round-trip time (RTT) and a minimum RTT, p is a packet loss rate, and δ ₁ and δ ₂ are adjustable coefficients.
The method as in any of Claims 41-50, wherein, for an action of the plurality of actions, the respective amount for modifying the congestion window variable is -1, 0, +1, or +3.
The method as in any of Claims 41-51, wherein the value function is a Q-function that is trained using a State-Action-Reward-State-Action (SARSA) temporal difference learning algorithm.
The method of Claim 52, wherein the first value function is updated as follows:

wherein
means y ₁= (1-α) y ₁+ αy ₂; s _i, a _i, r _i are state, action, and reward variables for storing respective values computed at the beginning of a time period i; α _t is the learning rate as a function of time t (e.g., in seconds) ; and γ is a discount factor; and

wherein n-1 is the first time, n is the second time; and n+1 is the third time.
The method of Claim 53, wherein:

r _n+1 is computed as follows:

Δ _n+1=U _n+1-U _n; and

U _n+1 is the second utility value and U _n is the first utility value.
The method as in any of Claims 41-54, further comprising:

determining, at the first time, the first network state; and

determining, at the second time, the second network state.
The method as in any of Claims 41-55, further comprising determining, in response to receiving a transmission signal, a third network state, the signal being an acknowledgement (ACK) received by the communication device in response to a previous packet transmission by the communication device.
The method as in any of Claims 41-56, wherein the third action is further determined according to an ∈-greedy exploration-exploitation (E2) scheme.
A system for Transmission Control Protocol (TCP) congestion control, the system comprising:

a non-transitory memory storage comprising instructions; and

one or more processors in communication with the memory storage, wherein the one or more processors are configured to execute the instructions to:

determine, by a communication device and based on a first utility value and a second utility value, a reward for a first action, wherein:

the first utility value is determined using a utility function and corresponds to a first time period from a first time to a second time;

the second utility value is determined using the utility function and corresponds to a second time period from the second time to a third time; and

the first action corresponds to the first time and is one of a plurality of actions, each action of the plurality of actions comprising modifying a value of a congestion window variable for a communication network by a respective amount, the value of the congestion window variable representing a size of the congestion window;

update, by the communication device, a first value function indicating a first desirability value associated with a first network state and the first action, the first value function updated according to the reward and a second value function indicating a second desirability value associated with a second network state and a second action of the plurality of actions, the first network state corresponding to the first time, the second network state and the second action corresponding to the second time;

determine, by the communication device and according to the updated first value function, a third action of the plurality of actions; and

update the congestion window variable based on the third action.
The system of Claim 58, wherein:

a difference between the second time and the first time is at least a round trip time of a first packet in the communication network; and

a difference between the third time and the second time is at least a round trip time of a second packet in the communication network.
The system as in any of Claims 58-59, wherein determining, by the communication device and based on the first utility value and the second utility value, the reward is performed based at least in part on determining that a difference between the third time and the second time is greater than or equal to a round trip time.
The system as in any of Claims 58-60, wherein the one or more processors are further configured to execute the instructions to update the congestion window variable at least twice prior to updating the congestion window variable based on the third action.
The system as in any of Claims 58-61, wherein one of the following is true:

the first action is the same as the second action;

the first action is the same as the third action;

the second action is the same as the third action; and

the first action, the second action, and the third action are the same.
The system as in any of Claims 58-62, wherein one of the following is true:

the first action is different than the second action;

the first action is different than the third action;

the second action is different than the third action; and

the first action, the second action, and the third action are different.
The system as in any of Claims 58-63, wherein the first, second, and third network states are each represented as respective state vectors representing a congested condition of the communication network.
The system of Claim 64, wherein each state vector comprises values for:

an exponentially weighted moving average (EWMA) of acknowledgement (ACK) inter-arrival time;

an EWMA of packet inter-sending time;

a ratio of current round-trip time (RTT) and a minimum RTT;

a slow start threshold;

a current value of the congestion window variable; or

a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the current value of the congestion window variable.
The system as in any of Claims 58-65, wherein the second utility value, as determined using the utility function, is a function of throughput, delay, and packet loss rate.
The system of Claim 66, wherein the utility function is:

wherein tp is throughput, B is a bandwidth of a bottleneck in the communication network, d is a delay calculated as a difference between a current round-trip time (RTT) and a minimum RTT, p is a packet loss rate, and δ ₁ and δ ₂ are adjustable coefficients.
The system as in any of Claims 58-67, wherein, for an action of the plurality of actions, the respective amount for modifying the congestion window variable is -1, 0, +1, or +3.
The system as in any of Claims 58-68, wherein the value function is a Q-function that is trained using a State-Action-Reward-State-Action (SARSA) temporal difference learning algorithm.
The system of Claim 69, wherein the first value function is updated as follows:

wherein
means y ₁= (1-α) y ₁+αy ₂; s _i, a _i, r _i are state, action, and reward variables for storing respective values computed at the beginning of a time period i; α _t is the learning rate as a function of time t (e.g., in seconds) ; and γ is a discount factor; and

wherein n-1 is the first time, n is the second time; and n+1 is the third time.
The system of Claim 70, wherein:

r _n+1 is computed as follows:

Δ _n+1=U _n+1-U _n; and

U _n+1 is the second utility value and U _n is the first utility value.
The system as in any of Claims 58-71, wherein the one or more processors are further configured to execute the instructions to:

determine, at the first time, the first network state; and

determine, at the second time, the second network state.
The system as in any of Claims 58-72, wherein the one or more processors are further configured to execute the instructions to determine, in response to receiving a transmission signal, a third network state, the signal being an acknowledgement (ACK) received by the communication device in response to a previous packet transmission by the communication device.
The system as in any of Claims 58-73, wherein the third action is further determined according to an ∈-greedy exploration-exploitation (E2) scheme.
A non-transitory computer-readable media storing computer instructions for Transmission Control Protocol (TCP) congestion control, that when executed by one or more processors, cause the one or more processors to perform operations comprising:

determining, by a communication device and based on a first utility value and a second utility value, a reward for a first action, wherein:

the first utility value is determined using a utility function and corresponds to a first time period from a first time to a second time;

the second utility value is determined using the utility function and corresponds to a second time period from the second time to a third time; and

the first action corresponds to the first time and is one of a plurality of actions, each action of the plurality of actions comprising modifying a value of a congestion window variable for a communication network by a respective amount, the value of the congestion window variable representing a size of the congestion window;

updating, by the communication device, a first value function indicating a first desirability value associated with a first network state and the first action, the first value function updated according to the reward and a second value function indicating a second desirability value associated with a second network state and a second action of the plurality of actions, the first network state corresponding to the first time, the second network state and the second action corresponding to the second time;

determining, by the communication device and according to the updated first value function, a third action of the plurality of actions; and

updating the congestion window variable based on the third action.
The non-transitory computer-readable media of Claim 75, wherein:

a difference between the second time and the first time is at least a round trip time of a first packet in the communication network; and

a difference between the third time and the second time is at least a round trip time of a second packet in the communication network.
The non-transitory computer-readable media as in any of Claims 75-76, wherein determining, by the communication device and based on the first utility value and the second utility value, the reward is performed based at least in part on determining that a difference between the third time and the second time is greater than or equal to a round trip time.
The non-transitory computer-readable media as in any of Claims 75-77, wherein the operations further comprise updating the congestion window variable at least twice prior to updating the congestion window variable based on the third action.
The non-transitory computer-readable media as in any of Claims 75-78, wherein one of the following is true:

the first action is the same as the second action;

the first action is the same as the third action;

the second action is the same as the third action; and

the first action, the second action, and the third action are the same.
The non-transitory computer-readable media as in any of Claims 75-79, wherein one of the following is true:

the first action is different than the second action;

the first action is different than the third action;

the second action is different than the third action; and

the first action, the second action, and the third action are different.
The non-transitory computer-readable media as in any of Claims 75-79, wherein the first, second, and third network states are each represented as respective state vectors representing a congested condition of the communication network.
The non-transitory computer-readable media of Claim 81, wherein each state vector comprises values for:

an exponentially weighted moving average (EWMA) of acknowledgement (ACK) inter-arrival time;

an EWMA of packet inter-sending time;

a ratio of current round-trip time (RTT) and a minimum RTT;

a slow start threshold;

a current value of the congestion window variable; or

a combination of the EWMA of ACK inter-arrival time, the EWMA of packet inter-sending time, the ratio of the current RTT and the minimum RTT, the slow start threshold, and the current value of the congestion window variable.
The non-transitory computer-readable media as in any of Claims 75-82, wherein the second utility value, as determined using the utility function, is a function of throughput, delay, and packet loss rate.
The non-transitory computer-readable media of Claim 83, wherein the utility function is:

wherein tp is throughput, B is a bandwidth of a bottleneck in the communication network, d is a delay calculated as a difference between a current round-trip time (RTT) and a minimum RTT, p is a packet loss rate, and δ ₁ and δ ₂ are adjustable coefficients.
The non-transitory computer-readable media as in any of Claims 75-84, wherein, for an action of the plurality of actions, the respective amount for modifying the congestion window variable is -1, 0, +1, or +3.
The non-transitory computer-readable media as in any of Claims 75-85, wherein the value function is a Q-function that is trained using a State-Action-Reward-State-Action (SARSA) temporal difference learning algorithm.
The non-transitory computer-readable media of Claim 86, wherein the first value function is updated as follows:

wherein
means y ₁= (1-α) y ₁+αy ₂; s _i, a _i, r _i are state, action, and reward variables for storing respective values computed at the beginning of a time period i; α _t is the learning rate as a function of time t (e.g., in seconds) ; and γ is a discount factor; and

wherein n-1 is the first time, n is the second time; and n+1 is the third time.
The non-transitory computer-readable media of Claim 87, wherein:

r _n+1 is computed as follows:

Δ _n+1=U _n+1-U _n; and

U _n+1 is the second utility value and U _n is the first utility value.
The non-transitory computer-readable media as in any of Claims 75-88, wherein the operations further comprise:

determining, at the first time, the first network state; and

determining, at the second time, the second network state.
The non-transitory computer-readable media as in any of Claims 75-89, wherein the operations further comprise determining, in response to receiving a transmission signal, a third network state, the signal being an acknowledgement (ACK) received by the communication device in response to a previous packet transmission by the communication device.
The non-transitory computer-readable media as in any of Claims 75-90, wherein the third action is further determined according to an ∈-greedy exploration-exploitation (E2) scheme.
A system for Transmission Control Protocol (TCP) congestion control, the system comprising:

means for determining, by a communication device and based on a first utility value and a second utility value, a reward for a first action, wherein:

the first utility value is determined using a utility function and corresponds to a first time period from a first time to a second time;

the second utility value is determined using the utility function and corresponds to a second time period from the second time to a third time; and

the first action corresponds to the first time and is one of a plurality of actions, each action of the plurality of actions comprising modifying a value of a congestion window variable for a communication network by a respective amount, the value of the congestion window variable representing a size of the congestion window;

means for updating, by the communication device, a first value function indicating a first desirability value associated with a first network state and the first action, the first value function updated according to the reward and a second value function indicating a second desirability value associated with a second network state and a second action of the plurality of actions, the first network state corresponding to the first time, the second network state and the second action corresponding to the second time;

means for determining, by the communication device and according to the updated first value function, a third action of the plurality of actions; and

means updating the congestion window variable based on the third action.