CN116680477A

CN116680477A - Personalized problem recommendation method based on reinforcement learning

Info

Publication number: CN116680477A
Application number: CN202310703313.2A
Authority: CN
Inventors: 张天成; 李季; 李捷; 张馨艺; 于明鹤; 于戈
Original assignee: 东北大学
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-09-01

Abstract

The invention provides a personalized problem recommendation method based on reinforcement learning, and relates to the technical field of education data mining. According to the invention, the learning record of the learner is firstly obtained, the potential knowledge level of the learner is judged through the knowledge tracking model, and the potential knowledge level is used as a part of the characteristics of the learner, so that the characteristic modeling of the learner is more accurate. And then, the unsatisfactory problems selected by the learner by mistake in the problem records are deleted through a reinforcement learning algorithm, so that the recommendation accuracy is improved. And finally, performing problem recommendation on the learner through the personalized recommendation model. The invention combines personalized recommendation, knowledge tracking and reinforcement learning algorithms, considers the potential knowledge level of a learner, removes the influence caused by wrong selection problems in the learning process, and has important theoretical and practical application values.

Description

Personalized problem recommendation method based on reinforcement learning

Technical Field

The invention relates to the technical field of education data mining, in particular to a personalized problem recommendation method based on reinforcement learning.

Background

The development of emerging information and communication technologies such as mobile communications, the internet of things, cloud computing, big data, and artificial intelligence is changing the thinking, production, life, and learning ways of humans. The present education is developing in the direction of 'networking, digitalization, individuation, ubiquitous and intelligent' as main characteristics, and a large number of novel education modes such as mobile learning, generalized learning, intelligent learning, mixed learning and the like are emerging.

In recent years, online learning is an emerging personalized learning mode, and by virtue of convenience, openness and richness of learning resources, the online learning successfully attracts registration and use of a large number of learners. In a new generation of learning environment based on the Internet, the learning time is more flexible, the learning method is more various, and the learning resources are more abundant. The learner can autonomously arrange learning time, learning mode and learning resources according to self-learning conditions and learning targets.

However, unlike the conventional classroom, the online education platform cannot supervise and guide the learner in real time, thus creating problems of 'information overload' and 'knowledge navigation'. These problems are mainly manifested in that when a learner faces a large number of learning resources with good quality, a great deal of time is often required to find the learning resources of interest, and meanwhile, how to perform learning planning is not known, and sometimes learning cannot be completed effectively even if a great deal of time is spent. These problems may lead to reduced learning efficiency, reduced learning quality, reduced learning enthusiasm, and increased risk of learning failure. The occurrence of these problems has led to the foreshadowing of numerous educators and researchers, and how to use computers to guide and assist learners instead of teachers has gradually become a popular direction of research.

The problem that online learners are difficult to find interesting problem resources when facing massive learning resources is solved, a feasible personalized problem recommendation algorithm method is provided, so that learning efficiency of the learners is greatly improved, and the problem is needed to be solved at present, and the following three problems are needed to be considered:

first, how to accurately construct the learner's features.

The conventional personalized recommendation model, whether a matrix decomposition model, a cyclic neural network model or an attention mechanism model, models the characteristics of a learner through the problem records of the learner when solving the problem recommendation problem, and does not consider the performance of the learner on the practice problem, so the following problems may occur: assume that learner i and learner j have substantially the same problem record, but different performance on the problem. Learner i has made most exercises and learner j has made most exercises, so that the exercises they select at the next moment are likely to be different.

It can be seen that building features of a learner based only on problems that the learner has made is not accurate enough. How to consider potential knowledge levels of a learner when modeling the learner is a primary concern.

Secondly, how to dig out the influence caused by the wrong choice problem in the learning process.

Often, a learner selects unsatisfactory problems, such as unsatisfactory difficulty or unsatisfactory category, but the problem records do not include the satisfaction degree of the learner on the problems, and the problems of the wrong selection form interference items when modeling the interest characteristics of the learner. Although researchers have attempted to distinguish the importance of problems by assigning different attention coefficients to each of the learner's historical problems through an attention mechanism, the effects of these misconvergence problems still cannot be completely eliminated. How to remove the effect of the misconvergence problem is a necessary issue to consider.

Thirdly, how to accurately conduct problem recommendation.

After considering the potential knowledge level of the learner and removing the influence caused by the wrong choice of the problems, what is needed to be done finally is how to accurately recommend the problems to the learner. The choice of which personalized recommendation algorithm is therefore an important issue to consider.

The problem encountered in online education is handled in combination with reinforcement learning-related algorithms, which is a research hotspot in current educational data mining. The knowledge tracking model, the personalized recommendation model and the reinforcement learning model are combined, the potential knowledge level of a learner is considered, the influence caused by the problem of wrong selection is removed, and the problem of information overload in online education is effectively solved. Personalized problem recommendation by reinforcement learning in online education is a better way to improve learning efficiency of learners.

Disclosure of Invention

The invention aims to solve the technical problem of providing a personalized problem recommendation method based on reinforcement learning aiming at the defects of the prior art, which is a personalized problem recommendation method based on reinforcement learning and combining a knowledge tracking model and a personalized recommendation model, and is used for solving the real problem that a learner is difficult to find interested learning resources in online education.

In order to solve the technical problems, the invention adopts the following technical scheme:

a personalized problem recommendation method based on reinforcement learning comprises the following steps:

step 1: calculating potential knowledge level of the learner by using the knowledge tracking model, and adding the potential knowledge level into the characteristic construction of the personalized recommendation model and the state representation of the problem record modification model;

step 2: constructing and training a personalized recommendation model for problem recommendation;

step 3: designing and training a problem record modification model based on a Deep Q-Learning algorithm of reinforcement Learning to remove dislike or dissatisfaction problems selected by mistake in the Learning process;

step 4: performing joint training on the personalized recommendation model and the problem record modification model;

step 5: and (3) modifying the problem records of the learner by using the problem record modification model obtained after the combined training in the step (4), and recommending the problems of the learner by using the personalized recommendation model obtained after the combined training in the step (4) to obtain a problem recommendation list.

Further, in the step 1, the knowledge tracking model is a depth knowledge tracking model DKT; the DKT model predicts the question score at the next moment according to the historic learning record of the learner by utilizing a time sequence relation through a long-short-period memory network LSTM; the DKT model firstly generates a one-hot vector from the historical achievements of a learner through one-hot coding, the one-hot vector is input into an LSTM network, features are extracted through the LSTM layer, the extracted features are input into a hidden layer, then a prediction result is output from an output layer, and the output of the DKT model represents the probability of each problem correctly answered by the learner, namely the achievement of the next answer of the learner; the output of the LSTM layer is used as the potential knowledge level of the learner and is added to the characteristic construction of the personalized recommendation model and the state representation of the problem record modification model. The input of DKT model is training record of learnerThe exercise record of learner i at time t is specifically denoted +.> wherein />Question indicating that learner i selected at time t,/->The answer result of the learner i at the time t is shown; recording of exercises->Comprises only the exercises of learner i selection learning, exercise record +.>The answer result of learner i is also recorded.

Further, the personalized recommendation model in the step 2 comprises three parts, namely an Embedding layer, a GRU layer and a full connection layer; the Embedding layer is used for mapping one-hot vectors of problem records made by learners to a low-dimensional vector space for encoding; the GRU layer is a gating circulation unit layer, and the layer is also an improved circulation neural network model and is used for extracting sequence characteristics of problem records; the full-connection layer is used for calculating the probability of each problem selected by the learner through the characteristics of the learner, and recommending the problems for the learner according to the size of the selected probability.

Further, the specific method of the step 2 is as follows:

step 2-1: recording problems made by the learner i through an Embedding layerIs>Mapping to low-dimensional vector space for coding, and outputting as low-dimensional vector

Step 2-2: extracting sequence features of problem records through the GRU layer;

the update gate of the GRU determines the amount by which the state information at the previous time and the state information at the current time continue to be transferred into the future, and the calculation formula is as follows:

wherein ,low representing problem done by learner i at time tDimension vector representation, h _t-1 Hidden state information indicating time t-1, W _z The weight coefficient representing the update gate, σ (·) is the sigmod activation function;

the reset gate of the GRU layer determines the amount by which the state information of the previous time is to be forgotten, and the calculation formula is as follows:

wherein ,W_r A weight coefficient representing a reset gate;

the calculation formula of the current memory content is shown as follows:

wherein ,W_h Is another weight coefficient of the reset gate, reset gate r _t And hidden state information h _t-1 The corresponding element product of (a) determines the information to be preserved at the previous moment, which is an operator representing the dot product of the matrix;

the final memorized calculation formula of the current time step is shown as follows:

wherein, (1-z _t )*h _t-1 Information representing the previous time is retained to the amount that is ultimately remembered at the current time,representing the amount of final memory of the current memory content reserved to the current moment; h finally obtained _t The sequence characteristic of the problem records of the learner;

step 2-3: the probability of each problem selected by the learner is calculated according to the characteristics of the learner through the full connection layer, and the following formula is shown:

y＝softmax(W _j ·[K ⁱ ，h _t ]+b _j )

wherein ,W_j Is the weight coefficient of the full connection layer, b _j Is the bias factor of the fully connected layer,is the potential knowledge level of learner i calculated by the DKT model; [ K ] ⁱ ，h _t ]Is the sequence characteristic h of the learner problem record obtained by combining the potential knowledge level of the learner i with the GRU layer _t Splicing; softmax (·) is an activation function, limiting the output value between 0 and 1;

step 2-4: the personalized recommendation model adopts cross entropy as a loss function to train and update the model, and the calculation formula is shown as follows:

wherein M is the number of learners, p _i Is the true probability distribution of the problem selected by learner i at the next moment, q _i The personalized recommendation model gives out the prediction probability distribution for representing the problem selected by the learner i at the next moment;

the cross entropy loss function is an index for measuring the difference between the real probability distribution p and the model predictive probability distribution q;

step 2-5: and sequencing the probability of selecting each problem by the learner i calculated by the personalized recommendation model according to the sequence from big to small, and recommending the first K problems to the learner i to form a problem recommendation list.

Further, the problem record modification model in the step 3 adopts a reinforcement learning related algorithm, including action representation, state representation, rewarding function of the model and reinforcement learning algorithm, and specifically comprises the following steps:

in order to delete problems that are disliked or unsatisfied in the learning process of the learner, the action a of each step _t With only two values, a _t =0 means that the problem is deleted in the problem record, a _t =1 means that the problem is kept in the recordLeaving the problem;

the status of the learner is represented by the following formula:

S＝[k ₁ ，k ₂ ，...，k _N ，p ₁ ，p ₂ ，...，p _N ]

wherein ,k₁ ，k ₂ ，...，k _N Representing potential knowledge levels of learners, specifically to the ith learner as Given by a knowledge tracking model; p is p ₁ ，p ₂ ，...，p _N Is a low-dimensional vector representation of learner problem records and location identifiers that function to record the location of modifications;

the reward function of the reinforcement learning module is given by a personalized recommendation model, and the form is shown as follows:

wherein ,e_target Is the problem actually selected by the learner at the next moment,representing the probability of selecting a target problem based on the modified problem records, p (e _target |E ⁱ ) Representing a probability of selecting a target problem based on the original problem record; the reinforcement learning module adopts a round update strategy, and obtains a reward function only after finishing the modification of the whole learning record of one learner, and the reward function is 0 at the rest time;

the reinforcement learning algorithm adopts a depth Q network algorithm DQN, and the algorithm combines a neural network and a Q-learning algorithm in the traditional reinforcement learning algorithm;

the reinforcement learning module takes the square of the difference between the true value and the predicted value as a loss function, trains and updates the parameters of the DQN model, and the specific formula of the loss function is shown as follows:

wherein ,Q_θ (s _t ，a _t ) Represented in state s _t Lower selection action a _t Calculating the obtained predicted value of the rewards by a predicted Q network, wherein the network parameter of the predicted Q network is theta;representing state s _t Lower selection action a _t A true value of the prize available; wherein->Calculated by the target Q network, represents the next state s _t+1 The maximum prize value that can be obtained, the network parameters of the target Q network are +.>r _t Is the current available reward value, which is given by the reward function;

the gradient of the loss function is shown as follows:

network parameters are updated according to the gradient descent.

Further, the specific procedure of modifying the learner problem record in the step 3 is as follows:

step 3-1: initializing a model, including initializing parameters of a predictive Q network and a target Q network; initializing an experience playback pool, wherein the capacity is N; initializing a learner-modified problem record setLearner index i=1, whenT=0;

step 3-2: obtaining problem records of learner iAnd an initial state s ₀ ；

Step 3-3: state s _t Feature vector phi(s) _t ) As the input of the predictive Q network, obtaining the Q value corresponding to the action in the current state;

step 3-4: selecting action a in current Q value by adopting epsilon-greedy strategy _t ；

Step 3-5: if a is _t =0, deleteIs->

Step 3-6: in state s _t Executing the current action a _t Obtaining the next state s _t+1 Sum prize r _t ；

Step 3-7: will { s ] _t ，a _t ，r _t ，s _t+1 This quadruple is stored in an experience playback pool;

step 3-8: updating state s _t ＝s _t+1 ；

Step 3-9: sampling m samples { s } from an empirical playback pool _j ，a _j ，r _j ，s _j+1 J=1, 2, …, m, calculating the current target Q value y _j ：

Step 3-10: using a mean square error loss functionUpdating parameters of the predictive Q network;

step 3-11: updating parameters of the target Q network after each step C, wherein the parameter value is the parameter value of the current predicted Q network;

step 3-12: judging whether the moment reaches a set value T or not; if not, returning to the step 3-3; if so, executing the next step;

step 3-13: record E of the problem after modification ⁱ Is marked asWill->Added to the problem record set modified by the learner

Step 3-14: judging whether all the problem records of the learners are modified, if not, returning to the step 3-2, continuing the modification of the problem records of the next learner, and if so, ending the step.

Further, the process of the step 4 joint training is specifically as follows:

step 4-1: initializing parameters α=α of personalized recommendation model ₀ Parameter β=β of knowledge tracking model ₀ And the parameter θ=θ of the reinforcement learning module ₀ ；

Step 4-2: using learner exercise recordsTraining a knowledge tracking model;

step 4-3: recording of problems with learnerTraining the personalized recommendation model by the knowledge tracking model;

step 4-4: parameter α=α of the fixed personalized recommendation model ₁ And parameter β=β of knowledge tracking model ₁ Pre-training the reinforcement learning module; the specific method comprises the following steps:

step 4-4-1: reinforced learning algorithm in problem recordingA step of up-selecting action;

step 4-4-2: calculating a Reward function Reward according to the selected action;

step 4-4-3: updating parameters of the reinforcement Learning module according to a loss function of the Deep Q-Learning algorithm;

step 4-4-4: circularly executing the steps 4-4-1 to 4-4-3 until all problems are recordedThe circulation is completed;

step 4-4-5: repeating the steps 4-4-1 to 4-4-4 until the parameters of the reinforcement learning module reach the optimal values;

step 4-5: parameter β=β for fixed knowledge tracking ₁ Performing joint training on the personalized recommendation model and the reinforcement learning module; the specific method comprises the following steps:

step 4-5-1: reinforced learning algorithm in problem recordingA step of up-selecting action;

step 4-5-2: calculating a Reward function Reward according to the selected action;

step 4-5-3: updating parameters of the reinforcement Learning module according to a loss function of the Deep Q-Learning algorithm;

step 4-5-4: step 4-5-1 to step 4-5-3 are circularly performed until allThe circulation is completed;

step 4-5-5: updating parameters of the recommendation model according to the loss function of the recommendation model;

step 4-5-6: and repeatedly and circularly executing the steps 4-5-1 to 4-5-5 until the parameters of the personalized recommendation model and the reinforcement learning module reach the optimal.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: according to the personalized problem recommendation method based on reinforcement learning, firstly, the learning record of a learner is obtained, the potential knowledge level of the learner is judged through the knowledge tracking model, and the potential knowledge level is used as a part of the characteristics of the learner, so that the characteristic modeling of the learner is more accurate. Then, the chapter tries to delete unsatisfactory problems selected by the learner by mistake in the problem records through a reinforcement learning algorithm, so that the recommendation accuracy is improved. And finally, performing problem recommendation on the learner through the personalized recommendation model. The method combines personalized recommendation, knowledge tracking and reinforcement learning algorithms, considers the potential knowledge level of a learner, removes the influence caused by wrong selection problems in the learning process, and has important theoretical and practical application values.

Drawings

FIG. 1 is a diagram of a personalized problem recommendation model provided by an embodiment of the present invention;

FIG. 2 is a flowchart of a personalized problem recommendation method based on reinforcement learning according to an embodiment of the present invention;

fig. 3 is a block diagram of a knowledge tracking model DKT provided by an embodiment of the present invention;

FIG. 4 is a block diagram of a long and short term memory network LSTM provided by an embodiment of the present invention;

FIG. 5 is a block diagram of a personalized recommendation model provided by an embodiment of the present invention;

fig. 6 is a block diagram of a deep Q network DQN provided by an embodiment of the invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

The embodiment provides a personalized problem recommendation method based on reinforcement learning, as shown in fig. 1, a model constructed by the method of the embodiment is composed of three parts, namely a knowledge tracking model, a personalized recommendation model and a problem record modification model. The knowledge tracking model can calculate potential knowledge level of the learner and add the potential knowledge level to the characteristic construction of the personalized recommendation model and the state representation of the problem record modification model. The personalized recommendation model provides a reward function for the problem record modification model and recommends problems for learners. The problem record modification model modifies the history problem record of the learner, and judges and updates the modified problem according to the reward function provided by the personalized learning model, thereby improving the accuracy of problem recommendation. The flow of this method is shown in FIG. 2, and the specific method is as follows.

Step 1: the potential knowledge level of the learner can be calculated by using the knowledge tracking model and added to the characteristic construction of the personalized recommendation model and the state representation of the problem record modification model.

The knowledge tracking model employed in this embodiment is a depth knowledge tracking model (Deep Knowledge Tracing, DKT). The DKT model predicts the topic performance at the next moment according to the historic learning record of the learner by using a time sequence relation through a cyclic neural network or a long-short-term memory network LSTM. The recurrent neural network in this embodiment employs a long short-term memory network LSTM. The DKT model firstly generates a one-hot vector from the historical achievements of the learner through one-hot coding, inputs the one-hot vector into an LSTM network, extracts characteristics through the LSTM layer, inputs the characteristics into a hidden layer, and then outputs a prediction result from an output layer, wherein the output of the DKT represents the probability of the learner to answer each problem correctly, namely the achievement of the next answer of the learner.

The structure of the DKT model is shown in fig. 3, where the model is a knowledge tracking model based on Long Short-Term Memory (LSTM) network, and the potential knowledge level of the learner can be determined by the learner's performance on the learning record. The input of the DKT model is the training record of learner iThe exercise record of learner i at time t is specifically denoted +.> wherein />Problem number indicating the choice of learner i at time t,/->The learner i's performance on the problem at time t is shown, 1 indicates that the problem was done, and 0 indicates that the problem was done incorrectly. First will->Conversion to one-hot vector by one-hot coding>Is input into the LSTM network.

The LSTM network is an improved recurrent neural network, which can solve the problem that RNNs cannot handle long-distance dependence, and the LSTM structure is shown in figure 4.

Different from the cyclic neural network, the long-term and short-term memory neural network introduces a memory state, and the three gating units of the neurons are used for controlling the stored information, so that the memory state of the neurons always stores the information of the whole long sequence.

The forget gate in the LSTM network is responsible for controlling the state reservation at the last moment, and the calculation formula is shown as follows:

wherein ,W_f Is a weight matrix of the forgetting gate,is the input of a forgetting gate at time t, here the training record of learner i at time t,/>Representing the concatenation of two vectors, h _t-1 Indicating the output at time t-1, b _f The bias term representing the forgetting gate, σ (·) is the sigmoid activation function.

An input gate in the LSTM network is responsible for controlling the input of the current state into the long-term state, and the calculation formula is as follows:

wherein ,W_I Is the weight matrix of the input gate, b _I Is an offset term of the input gate.

The cell state of the current input is represented as follows:

wherein ,W_c Is a weight matrix of cell states, b _c Is an offset term for the cell state, and tanh is an activation function.

Through the above three formulas, the cell state C at the previous time _t-1 The cell state at the current time is obtained as shown in the following formula:

where x is an operator representing the dot product of the matrix.

The output gate in the LSTM network is responsible for controlling whether the long-term status is taken as the current output, which is expressed as follows:

wherein ,W_o Is the weight matrix of the output gate, b _c Is the bias term of the output gate.

Finally, the output state is obtained by the following formula:

h _t ＝o _t *tanh(C _t )

the DKT model can comprehensively consider the exercise performance of the learner for a long time and the recent exercise performance, therebyThe potential knowledge level of the learner is determined. And wherein the design of the forgetting gate conforms to the feature that the learner will decrease over time, with a gradual decrease in the level of mastery of previously learned knowledge. The present embodiment marks the output of the LSTM layer as the knowledge level at the potential N knowledge points of learner i asWhich is used as part of the learner's profile to enhance the performance of the recommendation.

Step 2: a personalized recommendation model is built and trained, and comprises three parts, namely an Embedding layer, a GRU layer and a full connection layer. The Embedding layer is used for mapping one-hot vectors of problem records made by a learner i to a low-dimensional vector space for encoding; the GRU layer is a gating circulation unit layer, and is also an improved circulation neural network model and used for extracting sequence characteristics of problem records; the function of the full connection layer is to calculate the probability of selecting each problem by the learner through the characteristic of the learner i, and to recommend problems for the learner according to the size of the selected probability. The personalized recommendation model has two functions: firstly, providing a reward function for the problem record modification model, and secondly, recommending problems for learners. The personalized recommendation model structure is shown in fig. 5, and the specific method is as follows.

Step 2-2: and extracting sequence features of the problem records through the GRU layer.

The GRU layer has only two operations, update gates and reset gates. The GRU layer calculates the output of the reset gate and the update gate according to the input of the current moment and the network hiding state of the last moment, calculates the candidate hiding state according to the input of the current moment and the output of the reset gate, obtains the final hiding state according to the candidate hiding state and the output of the update gate, and obtains the output of the current moment according to the hiding state.

wherein ,a low-dimensional vector representation representing problems performed by learner i at time t, h _t-1 Hidden state information indicating time t-1, W _z The weight coefficient representing the update gate, σ (·) is the sigmoid activation function.

wherein ,W_r Representing the weight coefficient of the reset gate.

The calculation formula of the current memory content is shown as follows:

wherein ,W_h Is reset gate r _t Is used to reset the gate r _t And hidden state information h _t-1 The corresponding element product of (a) determines the information to be retained at the previous instant, an operator representing the dot product of the matrix.

wherein, (1-z _t )*h _t-1 Information representing the previous time is retained to the amount that is ultimately remembered at the current time,indicating the amount of memory that the current memory content remains to the end of the current time. H finally obtained _t Is the sequence feature of the learner problem record.

y＝softmax(W _j ·[K ⁱ ，h _t ]+b _j )

wherein M is the number of learners, p _i Is the true probability distribution of the problem selected by learner i at the next moment, q _i Is to give a personalized recommendation model to indicate that the learner i is at the next momentA predictive probability distribution of the problem is selected. The cross entropy loss function is an indicator that measures the difference between the true probability distribution p and the model predictive probability distribution q.

Step 2-5: and sequencing the probability of selecting each problem by the learner i obtained by calculation of the personalized recommendation model according to the sequence from big to small, and recommending the first K problems to the learner i to form a problem recommendation list.

Step 3: a problem record modification model is constructed and trained to remove dislike or dissatisfied problems which are mistakenly selected by a learner in the learning process, so that problem recommendation is more accurately performed for the learner. Because the problem record modification model adopts a reinforcement learning related algorithm, the action representation, the state representation, the rewarding function and the reinforcement learning algorithm of the model are described in detail according to the general development flow of reinforcement learning.

(1) Motion representation

The problem record modification model is used for deleting the problems that the learner dislikes or is dissatisfied with, so that the action a of each step _t With only two values, a _t =0 means that the problem is deleted in the problem record, a _t =1 means that the problem is retained in the problem record.

(2) State representation

The status of the learner is represented by the following formula:

S＝[k ₁ ，k ₂ ，...，k _N ，p ₁ ，p ₂ ，...，p _N ]

wherein ,k₁ ，k ₂ ，...，k _N Representing potential knowledge levels of the learner, given by a knowledge tracking model; p is p ₁ ，p ₂ ，...，p _N Is a low-dimensional vector representation of the learner problem record and a location identifier that functions to record the location of the modification.

(3) Reward function

wherein ,e_target Is the problem actually selected by the learner at the next moment,representing the probability of selecting a target problem based on the modified problem records, p (e _target |E ⁱ ) Representing the probability of selecting a target problem based on the original problem record. The reinforcement learning module adopts a round-trip updating strategy, and obtains the rewarding function only after finishing the modification of the whole learning record of one learner, and the rewarding function is 0 at the rest time.

(4) Reinforcement learning algorithm

The present embodiment employs a Deep Q Network (DQN) algorithm that combines a neural Network with a Q-learning algorithm in a conventional reinforcement learning algorithm. The structure of the DQN is shown in fig. 6.

wherein ,Q_θ (s _t ，a _t ) Represented in state s _t Lower selection action a _t And calculating the obtained predicted value of the rewards by a predicted Q network, wherein the network parameter of the predicted Q network is theta.Representing state s _t Lower selection action a _t The true value of the prize that can be achieved. Wherein->Calculated by the target Q network, represents the next state s _t+1 Maximum prize value obtainable, target QThe network parameters of the network are->r _t Is the current prize value available, given by the prize function.

The gradient of the loss function is shown in the following equation, and the network parameters are updated according to the gradient descent.

The specific procedure for the learner problem record modification is as follows:

step 3-1: initializing a model, including initializing parameters of a predictive Q network and a target Q network; initializing an experience playback pool, wherein the capacity is N; initializing a learner-modified problem record setLearner index i=1, time t=0;

step 3-2: obtaining problem records of learner iAnd an initial state s ₀ ；

Step 3-5: if a is _t =0, deleteIs->

step 3-8: updating state s _t ＝s _t+1 ；

Step 4: and carrying out combined training on the personalized recommendation model and the problem record modification model to obtain optimal model parameters, and improving the accuracy of problem recommendation. The combined training process of the personalized problem recommendation model based on the reinforcement learning algorithm provided in this embodiment is specifically as follows.

Step 4-2: using learner exercise recordsRecording and training a knowledge tracking model;

step 4-3: recording using problemsRecording and training the personalized recommendation model by the knowledge tracking model;

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. A personalized problem recommendation method based on reinforcement learning is characterized in that: the method comprises the following steps:

2. The reinforcement learning-based personalized problem recommendation method of claim 1, wherein: in the step 1, the knowledge tracking model is a depth knowledge tracking model DKT; the DKT model predicts the question score at the next moment according to the historic learning record of the learner by utilizing a time sequence relation through a long-short-period memory network LSTM; the DKT model firstly generates a one-hot vector from the historical achievements of a learner through one-hot coding, the one-hot vector is input into an LSTM network, features are extracted through the LSTM layer, the extracted features are input into a hidden layer, then a prediction result is output from an output layer, and the output of the DKT model represents the probability of each problem correctly answered by the learner, namely the achievement of the next answer of the learner; taking the output of the LSTM layer as the potential knowledge level of a learner, and adding the output into the characteristic construction of the personalized recommendation model and the state representation of the problem record modification model; the input of DKT model is training record of learnerTraining memory of learner i at time tThe transcript is shown as-> wherein />Question indicating that learner i selected at time t,/->The answer result of the learner i at the time t is shown; recording of exercises->Comprises only the exercises of learner i selection learning, exercise record +.> The answer result of learner i is also recorded.

3. The reinforcement learning-based personalized problem recommendation method of claim 1, wherein: the personalized recommendation model in the step 2 comprises three parts, namely an Embedding layer, a GRU layer and a full-connection layer; the Embedding layer is used for mapping one-hot vectors of problem records made by learners to a low-dimensional vector space for encoding; the GRU layer is a gating circulation unit layer, and the layer is also an improved circulation neural network model and is used for extracting sequence characteristics of problem records; the full-connection layer is used for calculating the probability of each problem selected by the learner through the characteristics of the learner, and recommending the problems for the learner according to the size of the selected probability.

4. The reinforcement learning-based personalized problem recommendation method of claim 3, wherein: the specific method of the step 2 is as follows:

step 2-1: recording problems made by the learner i through an Embedding layerIs>Mapping to low-dimensional vector space for coding, and outputting as low-dimensional vector +.>

wherein ,a low-dimensional vector representation representing problems performed by learner i at time t, h _t-1 Hidden state information indicating time t-1, W _z The weight coefficient representing the update gate, σ (·) is the sigmod activation function;

wherein ,W_r A weight coefficient representing a reset gate;

the calculation formula of the current memory content is shown as follows:

y＝softmax(W _j ·[K ⁱ ,h _t ]+b _j )

wherein ,W_j Is the weight coefficient of the full connection layer, b _j Is the bias factor of the fully connected layer,is the potential knowledge level of learner i calculated by the DKT model; [ K ] ⁱ ,h _t ]Is the sequence characteristic h of the learner problem record obtained by combining the potential knowledge level of the learner i with the GRU layer _t Splicing; softmax (·) is an activation function, limiting the output value between 0 and 1;

5. The reinforcement learning-based personalized problem recommendation method of claim 4, wherein: the problem record modification model in the step 3 adopts a reinforcement learning related algorithm, comprising action representation, state representation, rewarding function of the model and reinforcement learning algorithm, and specifically comprises the following steps:

in order to delete problems that are disliked or unsatisfied in the learning process of the learner, the action a of each step _t With only two values, a _t =0 means that the problem is deleted in the problem record, a _t =1 means that the problem is retained in the problem record;

the status of the learner is represented by the following formula:

S＝[k ₁ ,k ₂ ,…,k _N ,p ₁ ,p ₂ ,…,p _N ]

wherein ,k₁ ,k ₂ ,…,k _N Representing potential knowledge levels of learners, specifically to the ith learner as Given by a knowledge tracking model; p is p ₁ ,p ₂ ,…,p _N Is a low-dimensional vector representation of learner problem records and location identifiers that function to record the location of modifications;

wherein ,Q_θ (s _t ,a _t ) Represented in state s _t Lower selection action a _t Calculating the predicted value of the obtained rewards from the predicted Q network, and predicting the network parameters of the Q networkθ;representing state s _t Lower selection action a _t A true value of the prize available; wherein->Calculated by the target Q network, represents the next state s _t+1 The maximum prize value that can be obtained, the network parameters of the target Q network are +.>r _t Is the current available reward value, which is given by the reward function;

the gradient of the loss function is shown as follows:

network parameters are updated according to the gradient descent.

6. The reinforcement learning-based personalized problem recommendation method of claim 5, wherein: the specific process of modifying the problem records of the learner in the step 3 is as follows:

step 3-2: obtaining problem records of learner iAnd an initial state s ₀ ；

Step 3-5: if a is _t =0, deleteIs->

Step 3-7: will { s ] _t ,a _t ,r _t ,s _t+1 This quadruple is stored in an experience playback pool;

step 3-8: updating state s _t ＝s _t+1 ；

Step 3-9: sampling m samples { s } from an empirical playback pool _j ,a _j ,r _j ,s _j+1 J=1, 2, …, m, calculating the current target Q value y _j ：

step 3-13: record E of the problem after modification ⁱ Is marked asWill->Add to learner modified problem record set +.>

7. The reinforcement learning-based personalized problem recommendation method of claim 6, wherein: the process of the step 4 joint training is specifically as follows:

Step 4-2: using learner exercise recordsTraining a knowledge tracking model;