Introduction

Adaptive online textbooks are one of the oldest technologies for personalized web-based learning. By using domain and student modeling approaches developed in the field of intelligent tutoring systems (ITS), the first generation of adaptive Web-based textbooks implemented and explored a range of novel personalization approaches, such as adaptive navigation support (Brusilovsky 2007a), adaptive content presentation (Bunt et al. 2007), and adaptive content recommendation (Manouselis et al. 2013). Similar to more traditional intelligent tutoring systems, personalization in the first generation of adaptive textbooks was powered by domain and content models developed by domain experts and “overlay” student models maintained on the top of the domain models (Holt et al. 1993). The domain and content modeling was typically performed by the experts through annotating textbook pages or sections with domain concepts presented in each unit (known as outcomes) and, in some cases, that were required to be known in order to understand the unit (known as prerequisites). The set of concepts used in the annotation process (frequently augmented with links) served as the domain model of a textbook. While early adaptive textbooks demonstrated solid performance (Weber and Brusilovsky 2001; Henze and Nejdl 2001; Davidovic et al. 2003; Papanikolaou et al. 2003), the expensive annotation process and the need to engage domain experts prevented the broad dissemination and use of this technology.

A noticeable shift of the publishing industry from printed textbooks to digital books and textbooks over the last ten years has encouraged the new round of attempts to build more intelligent textbooks. At the same time, recent progress in automatic keyphrase extraction (Augenstein et al. 2017) has made it possible to replace expert-driven annotation processes with automatic concept extraction from each textbook unit. Automatic techniques were also developed for recognising prerequisite concepts (Agrawal et al. 2014; Labutov et al. 2017) and connecting domain concepts with links (Wang et al. 2015; Yang et al. 2015). It has been demonstrated that automatic concept extraction could be used to recreate several valuable technologies that originally relied on expert-produced annotation, such as external content recommendation (Kokkodis et al. 2014) and prerequisite-based linking (Agrawal et al. 2014). Yet, from the educational perspective, the current work on “smart textbooks” has several gaps. Most importantly, almost all existing extraction approaches have been neither created nor evaluated with a focus on the target context, such as structured textbooks that could support personalized learning.

We attempt to bridge some of these gaps in this work, in which we focus on the fundamental task of concept keyword extraction from structured textbooks that could support personalized learning. This paper presents a thorough and systematic analysis of supervised learning, applied to the task of concept extraction from educational content. Previous work on this task has either focused on non-educational keyword extraction, or has not analyzed and experimented with concept-extraction in educational context at the same coverage and depth as we do in this paper. The key contributions of this work are as follows:

  • Concept annotation: We performed a rigorous and systematic annotation of concept keywords in a technical textbook. By improving our annotation protocol over multiple iterations, we have achieved a relatively high inter-annotator agreement and have ensured that the annotated keywords closely align to the underlying concepts.

  • Concept extraction: We engineered and experimented with a highly encompassing feature set for the machine learning to extract the annotated concepts. Our feature set spans both linguistic features and features encoding relative corpus statistics (i.e., summarizing relative word frequencies between technical and non-technical corpora).

  • Extraction evaluation: We performed systematic ablation studies of the proposed supervised model, as well as performed extensive comparative evaluation with a number of keyword-extraction models proposed in the literature.

  • Student modeling evaluation: We performed evaluation on student modeling using concepts extracted by FACE and compared our model with other keyword extraction methods. These concepts were used as knowledge components for our student model, on which students’ overall levels of knowledge was inferred.

Jim Greer, Student Modeling, and Knowledge-Based Hypermedia

It’s rare for a journal paper to go well beyond a review of immediately related work and to discuss a broader historical and inspirational context. A Festschrift is a great opportunity to do it. While a review of related work in its usual narrow sense (i.e., adaptive textbooks and concept extraction) is provided in the next section, this section attempts to explain how adaptive textbooks came to be and how they are intertwined with related streams of research on adaptive and concept-based hypermedia. Remarkably, Jim Greer’s research was instrumental in all these stages.

An adaptive textbook is an example of knowledge-based hypermedia, which is, in turn, an active stream of research in a broader field of adaptive hypermedia (Brusilovsky 2001). While early research on adaptive hypermedia explored the ability to adapt to different information about individual users, such as navigation history (De Bra 1996), cognitive traits (Carver and Ray 1996), or goals (Höök et al. 1996), it was the ability to adapt to user knowledge that appeared most exciting and prompted the largest number of follow-up studies. It is no secret that the idea of adaptation to user knowledge and the design of this adaptation was brought to adaptive hypermedia from research on intelligent tutoring systems and student modeling. This is the part where the work of Jim Greer was highly inspirational.

Together with Gordon McCalla, Jim has been involved in several influential projects that have focused on intelligent tutoring and student modeling (Greer et al. 1989; McCalla et al. 1990; Bhuiyan et al. 1992), which have important value within the field. However, even more important in the context of this paper was Jim’s role in promoting ideas of student modeling well beyond its field of origin. From the perspective of the last author of this paper, Jim made two major contributions: Jim’s paper on his own work on student modeling published in User Modeling and User-Adaptive Interaction (UMUAI) journal (Huang et al. 1991) and the NATO ASI workshop “Student Modelling: The Key to Individualized Knowledge-Based Instruction” that he co-organized with Gordon McCalla in 1991 and that was followed by a post-workshop book in 1993 (Greer and McCalla 1993). Both, the paper and the book propagated the ideas and the techniques of student modeling to a much broader audience. The paper was important because it was published in the first issue of the first volume of UMUAI. This journal targeted researchers working on any kinds of user-adaptive systems. With this paper, the ITS approach to student modeling has been firmly placed on the map of user-adaptive interactions and has inspired many researchers outside of the ITS field. Similarly, the ASI workshop book brought a diverse collection of work on student modeling to a broad audience outside of the usual ITS channels. Even more importantly, it included an excellent review of student modeling principles (Holt et al. 1993), which offered a clear explanation of domain modeling and overlay student modeling; ideas that became essential for the work on knowledge-based hypermedia and adaptive textbooks. The last author of this paper (Peter Brusilovsky) was just one of many researchers who have been influenced by this work. He had a lucky chance to meet Jim Greer shortly after the ASI workshop at the AIED’91 conference in Chicago and to learn about these new developments in the area of student modeling directly from the source. The first meeting with Jim was followed by Peter’s visit to the ARIES Lab in Saskatchewan in 1992, which was another great opportunity to discuss student modeling and exchange research ideas.

Jim’s work, however, was not just an inspiration for research on knowledge-based hypermedia and adaptive textbooks. He and his students directly contributed to this research at a number of points. Most important in the context of this paper was his work on the MicroWeb toolkit (Thomson et al. 1996), one of the early examples of combining ideas of ITS and Hypertext to build a knowledge-based hypermedia on the Web. The first MicroWeb paper was published in 1996, an important year in the history of adaptive hypermedia, when the first papers about several important pioneer works on Web-based adaptive educational hypermedia were published, including those on ELM-ART (Brusilovsky et al. 1996b), 2L670 (De Bra 1996), InterBook (Brusilovsky et al. 1996c), West (Eklund and Sawers 1996), and Hypercase (Micarelli and Sciarrone 1996). Moreover, the majority of these early works, including the MicroWeb paper, were presented at the same conference, the first installation of important WebNet series in 1996 (Thomson et al. 1996; De Bra 1996; Brusilovsky et al. 1996c; Carver and Ray 1996; Nakabayashi et al. 1996; Eklund and Sawers 1996). Until this year, each of the presenting teams had been independently working on their ideas of adaptive educational hypermedia. WebNet’96 was a great chance for all these researchers to meet, recognize common features of their research, appreciate and examine differences, and, together, come to a better conceptualization of this new field of research. The year 1996 and specifically WebNet’96 could be considered as an igniting point for the second generation of adaptive hypermedia research, which considerably expanded mostly pre-Web and non-educational research of the first adaptive hypermedia generation (Brusilovsky 1996a). The emergence of adaptive educational hypermedia as a research field was facilitated by a series of workshops where Jim and his team were active contributors, presenting updates of their work on adaptive hypermedia projects MicroWeb, APHID, and APHID2 (Greer and Philip 1997; Kettel et al. 2000). This work has had a significant impact on all future research on personalized Web-based educational systems, including adaptive textbooks, which are the focus of the paper. It is also vital to mention that Jim and his team were among the early contributors to the third wave of research on adaptive hypermedia focused on social navigation and other social information access technologies (Brooks et al. 2006), but discussing this development in more detail will deviate too far from the focus of this paper, which focuses on knowledge-driven, rather than socially-driven textbooks.

Adaptive Textbooks and Concept Extraction

Adaptive Textbooks

The present research on adaptive textbooks and other adaptive educational hypermedia systems has been motivated by the increasing popularity of the World Wide Web (WWW) and the opportunity to use this platform for learning. The hypertext nature of the early WWW made an online hypertext-based textbook a natural medium for learning, while the increased diversity of Web users stressed the need for adaptation. The first generation of adaptive textbooks (Brusilovsky et al.1996c, 1998; Henze et al. 1999; Murray 2001; Weber and Brusilovsky 2001; Melis et al. 2001; Kavcic 2004) focused on tracing student reading behaviors to guide students to the most relevant pages using adaptive navigation support (Henze et al. 1999; Brusilovsky and Pesin 1998; Weber and Brusilovsky 2001) and recommendation (Kavcic 2004), and to offer students knowledge-adapted content presentation (Melis et al. 2001). These types of personalization were based on sophisticated knowledge modeling provided by domain experts. As explained in the previous section, this technology was originally developed in the field of ITS. The core of the knowledge modeling in both ITS and adaptive textbooks is a structured domain model, which is usually developed as a set of domain concepts, rules, or other knowledge components. The domain model also serves as a basis for individual student overlay models (Holt et al. 1993). To use this knowledge-based approach in adaptive textbooks, each textbook page was manually annotated with a set of concepts presented on the page as well as prerequisite concepts required to understand the page.

The combination of advanced knowledge modeling and overlay student modeling supported relatively complex personalization approaches. For example, adaptive textbooks based on prerequisite modeling were able to distinguish “ready to be learned” content, which is content that bears new information without being too difficult to prevent users from understanding, from “not ready to be learned” content, which is content that is too complicated for the students in their current state of knowledge. To guide readers to most appropriate pages, a number of adaptive textbooks used the “traffic light” approach to annotate links as ready/not ready content (Brusilovsky et al. 1996b, 1998). Another popular knowledge-driven adaptive navigation support approach used “knowledge progression” icons to represent the user’s knowledge on the content of a specific section (Papanikolaou et al. 2003; Hsiao et al. 2010). Many studies confirmed the effectiveness of these approaches in the context of knowledge learning from online textbooks (Brusilovsky and Eklund 1998; Weber and Brusilovsky2001; Henze and Nejdl 2001; Davidovic et al. 2003; Papanikolaou et al.2003).

Despite the pedagogical success of these early adaptive textbooks, the complexity and the cost of expert-produced domain models and page-level annotations caused researchers who were interested in developing more intelligent textbooks to focus on automatic approaches that could make textbooks intelligent without engaging domain experts. Early attempts to re-implement functionality of the first-generation expert-annotated textbooks were associated with so-called open corpus adaptive educational hypermedia research stream (Brusilovsky and Henze 2007b). This stream explored already established approaches from the areas of information retrieval and semantic web models (Dolog and Nejdl 2003; Dicheva et al. 2009; Sosnovsky and Dicheva 2010) such as unigram-based vector models, language models, topic models, and automatic content mapping to existing ontologies to build various kinds of models of educational documents. It has been demonstrated that these types of modeling could support certain basic functionalities originally developed in the expert-annotated adaptive textbooks, such as establishing links between textbook chapters (Guerra et al. 2013b), suggesting relevant external material (Sosnovsky et al. 2012), and performing dynamic section-level novelty assessment (Lin and Brusilovsky 2011).

More recently, however, steady progress in the area of keyphrase extraction and content modeling has encouraged several research teams to explore this concept extraction idea in textbook context. This new stream of research uses already established concept extraction approaches and focuses on their application and augmentation rather than on improvement. In particular, serious efforts were spent on building approaches for establishing semantic and pedagogical connections between these concepts, such as concept hierarchies (Wang et al. 2015) and prerequisite structures (Chaplot et al. 2016). Several attempts have also been made to automatically separate prerequisite concepts (those that are necessary to understand a textbook unit) and outcome concepts (those that are explained within a unit Agrawal et al. 2014; Labutov and Lipson 2016, 2017).

Establishing prerequisite connections is an important advancement in the new generation of digital textbooks. While a flat set of concepts associated with each textbook section could be used to replicate some valuable features of adaptive textbooks, such as external content recommendation (Kokkodis et al. 2014), the most efficient approaches developed in the first generation of adaptive textbooks were based on prerequisite connections and dynamic knowledge modeling. So far, automatic concept extraction and prerequisite elicitation have already been successfully used to replicate some of these approaches. For example, the Study Navigator (Agrawal et al. 2014) replicated one of the classic prerequisite-based approaches of the first generation of adaptive textbooks (Brusilovsky and Pesin 1998) by generating concept links that connect the sections where a concept is used to the sections where it is explained.

The work presented in this paper attempts to augment the current generation of research on adaptive textbooks in two major ways. First, instead of simply re-using established extraction approaches in a textbook context, we want to improve these approaches by using certain unique features of digital textbooks. Second, we want to better leverage the results of the first generation of research on adaptive textbooks. We aim to produce a concept extraction approach that closely matches the section-level concept annotation produced by expert users and that could be used to build both domain and student models for more advanced personalization. To assess how well our approach achieves this goal, we evaluate it by both assessing how closely it approximates expert annotation and how well it supports student modeling needs.

Concept Extraction

Automatic keyphrase extraction has been extensively studied and examined using different approaches, such as rule-based learning, supervised learning, unsupervised learning, or deep neural networks.

Automatic keyphrase extraction systems typically consist of two parts (Augenstein et al. 2017): (1) preprocessing data and extracting a list of candidate keyphrases using lexical patterns and heuristics; and then (2) determining which of these candidates are correct keyphrases based on some ranking scores.

The goal of extracting the candidate keyphrase list is to obtain all of the potential candidates while keeping the number of candidates as small as possible. Several studies extract candidates from words with certain part-of-speech (POS) tags (e.g., nouns or noun-nouns) (Mihalcea and Tarau 2004; Bougouin et al. 2013; Liu et al. 2009a; Wan and Xiao 2008). Others extract n-grams with simple filtering rules (Witten et al. 1999; Medelyan et al. 2009) or only allow those ones that match to the titles of Wikipedia articles (Wang et al. 2015; Grineva et al. 2009). More complex approaches extracted noun phrases and applied predefined lexico-syntatic patterns (Florescu and Caragea 2017; Le et al. 2016).

The next step is to score each candidate based on certain properties that indicate how likely a candidate keyphrase to be a concept in the given document. Machine learning approaches to this task can be grouped into two categories: either supervised or unsupervised. Among unsupervised learning approaches, graph-based approaches (Mihalcea and Tarau 2004; Bougouin et al. 2013) considered a candidate keyphrase to be important if it is related to a large number of candidates and if those candidates are also important in the document. Candidates and the relations between them form a graph for the input document. A graph-based ranking (e.g., PageRank) is applied to give a score to each node. Finally, the top-ranked candidates are selected as keyphrases for the input document. Unsupervised topic-based clustering methods (Liu et al. 2009b, 2010; Grineva et al.2009) attempted to group semantically similar candidates in a document as topics.

Keyphrases are then selected, based on the centroid of each cluster or the importance of each topic.

The supervised learning approaches typically framed this task as a binary classification problem (Witten et al. 1999; Hulth 2003; Jiang et al. 2009). A variety of features have been used for training supervised models, including statistics-based features title-based features, linguistics-based features or external resources (Hammouda et al. 2005; Witten et al. 1999; Rose et al. 2010; Hulth 2003; Wang et al. 2015; Yih et al. 2006; Nguyen et al. 2007).

Deep learning approaches, which share features of both supervised and unsupervised learning, have been successfully applied to many NPL-related tasks, including named entity recognition (NER) and sequence tagging. However, few studies have focused on keyphrase extraction problem. Meng et al. (2017) built a deep keyphrase generation with an encoder-decoder framework. They applied an RNN-based generative model to predict keyphrases. Another study using deep sequence labeling with Bi-LSTM-CRF models has shown to outperform its unsupervised and supervised baselines (Alzaidy et al. 2019). However, deep learning models require a large amount of data to achieve their best performance, as compared with traditional machine learning approaches.

While many general keyphrase-extraction approaches exist, few have focused on an educational domain and almost none have considered a textbook corpus. There are a number of projects that apply book concepts to achieve a specific target; for example, building concept hierarchies for textbooks (Wang et al. 2015) or separating prerequisite and outcome concepts (Labutov et al. 2017). However, they did not focus on advanced concept extraction and instead use existing data (Labutov et al. 2017) or lightweight extraction approaches (Wang et al. 2015).

A related link of work has focused on building educational ontology from texts. Manual construction of ontologies is an extremely time- and cost-consuming process (Shamsfard and Barforoush 2004; Wong et al. 2012). Automatically constructing ontology is a complicated task that requires advanced technology in related areas, such as natural language processing or text mining. It requires the recognition of not only concepts described in texts, but also the relationships between them. The attempts of building ontologies from texts usually use existing technologies, such as NLPs or simple heuristic rules, to extract ontological concepts; for example, Shamsfard and Barforoush (2004) use a simple morphological and syntactic analysis to extract primary concepts in Persian texts; Zouaq et al. (2007) use a Stanford parser and KEA, a simple keyphrase extraction method, Wong et al. (2012) summarize a list of studies using different strategies (e.g., statistics-based, linguistics-based or logic-based); Conde et al. (2014) consider index items from a book as domain topics; and litewi (Conde et al. 2016) combines several unsupervised term extraction approaches and uses Wikipedia to provide additional information. However, to the best of our knowledge, the performance of existing automatic term and concept extraction methods remains underwhelming, especially in educational domains. Improving automatic keyphrase extraction is not only useful for immediate downstream tasks such as student modeling and content recommendation, but is also a step forward in accomplishing the task of automatic educational-ontology construction.

The work presented in this paper applies state-of-the art extraction approaches to the under-explored textbook context. We use a supervised method for concept extraction from textbooks with an extensive list of carefully selected features. We evaluate the approach on a brand new dataset and compare it with several state-of-art baselines, and have made both the code and data available on GithubFootnote 1.

The Dataset

One of the challenges for keyphrase extraction is to obtain a good dataset for training and testing the models; there are comparatively few datasets with labeled data for educational resources, such as textbooks, course descriptions, and slides. An added challenge for the educational context is its focus on knowledge transfer. As a result, educational applications usually refer to concepts associated with text, rather than keywords or keyphrases.

In this context, we define domain concepts as keyphrases (single words or short phrases of two to four words) that represent the most essential knowledge elements presented in a text fragment (e.g., a sentence, a paragraph, a section) with respect to its target domain (e.g., computer science (CS)) or a related domain (e.g. statistics). Those concepts should have specific meanings in the CS domain and be important in the information retrieval (IR) sub-domain, but may have different meanings in other domains. Without understanding the conceptual meaning, readers could not understand the content. For example, consider the sentences/paragraphs below:

Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation”. In this example, Tokenization and f==Tabtokens are domain concepts, but characters and punctuation are not.

To support our work on automatic concept extraction, we built a dataset with a section-level concept index for the first 16 chapters of the book “Introduction to Information Retrieval” (IIR) (Wang et al. 2020). For each section (the lowest-level unit in the Table Of Contents) of the textbook, the dataset provides a list of essential concepts mentioned in that section. The statistics of the dataset are shown in Table 1.

Table 1 Statistics of the IIR dataset

To build this dataset, we engaged three paid experts - one PhD student working in the IR domain and two Masters students who completed the IR course with high scores. Before the start of the process, the annotators received training and passed a test that focused on the understanding of the task, the “codebook” of annotation rules, and the annotation interface. Every week, three experts focused on completing annotations for one chapter (i.e., all sections that belonged to the chapter). After finishing an annotation session, they discussed the cases in which their annotations disagreed, made the final decision for the concept list, and, if necessary, added new “codebook” rules to help increase the agreement in the future. Throughout this process, the inter-annotator proportion agreement among the three annotators before discussion had gradually increased from 0.25 to 0.68 at week 3 and 0.9 at the end of the whole annotation process (see Fig. 1).

Fig. 1
figure 1

Inter-annotator proportion agreement results (week by week). The average agreements are the proportion agreements among three annotators. Average pair agreements are the average proportion agreements of three annotator pairs

Automatic Concept Extraction

It is important to stress that we attempt to focus on extracting true concepts from text; i.e., not any kind of keyphrases, but those that correspond to key domain concepts mentioned in the text. We achieve this goal by using a large number of features that have a good chance to capture specific linguistic, positional, or statistical ways in which concepts are expressed in text and by training our models on concept labels that are generated by domain experts. With this approach, we expect the model to learn the features that are specific for concept appearance in the text, which is considerably different from those of traditional term labels.

The Task Formulation

We formulated the concept extraction task in the following way: given a textbook that has multiple chapters where each chapter includes several sections, the system extracts a list of concepts appearing in each of the sections.

Concept extraction task is similar to the tasks of keyphrase extraction and named entity recognition. However, it is more challenging, because (1) concepts vary significantly across domains; (2) it is hard to define the boundary between the domains; and (3) there is a lack of clear signifiers and context. In order to perform this task, we recast it as a binary classification problem for a list of extracted candidates. We train a supervised learning model to classify a term or phrase candidate either to be a concept or not. The details of our framework is described in the next section.

The Framework

Preprocessing:

We preprocessed the textbook to extract the section names, titles, and text content of each section.

Data preparation:

This step includes noun phrase chunking and filtering to extract terms from texts. We used Stanford’s POS taggerFootnote 2 (Toutanova et al. 2003) to annotate each word in the text with its linguistic part of speech. Given our definition that a concept is a noun or noun phrase, we applied linguistic rules (e.g., ‘noun + noun’ or ‘adjective + noun’) using regular expressions to extract all possible nouns and noun phrases in the text. We only considered unigrams, bigrams, trigrams, and four-grams, which account for 99.42% of all the unique concepts (shown in Table 1).

After extracting all noun phrases, we used a stop-list to filter non-descriptive words (mostly determiners) that add no additional meaning to the concept (e.g., such, same, many, little, few, or certain). For instance, though “many searching algorithms” and “searching algorithms” are both noun phrases extracted from the text, it is easy to recognize that “many searching algorithms” should not be considered as a concept.

Let’s look at the example below:

The general strategy for determining a stop list is to sort the terms by collection frequency”.

After tagging: “The_DT general_JJ strategy_NN for_IN determining_VBG a_DT stop_NN list_NN is_VBZ to_TO sort_VB the_DT terms_NNS by_IN collection_NN frequency_NN ._”.

Final candidate list: {general strategy, strategy, stop list, stop, list, terms, collection frequency, collection, frequency}

Feature extraction:

After obtaining the final candidate list, we extracted all features for each of the candidates. The feature set includes linguistic features (e.g., POS, two tokens before, two tokens after), statistics features (e.g., term frequency, tf-idf), its match to external resources (i.e., Wikipedia titles and ACM keyword repository) and its presence in the section title. The details of the feature set are described in the next section.

Model training for concept classification:

We trained a logistic regression model on the feature vectors of candidate concepts to classify the terms or phrases extracted at the data preparation step that are concepts. All non-binary features in our model are binned and discretized as binary features. In this way, our logistic model is capable of learning non-linear relationships with those features. For the purposes of cross evaluation, we split the data into 5 folds; each fold consists of 80% for training and 20% for testing. As we were aware of the cases that multiple candidates could be from the same phrase (e.g., ‘postings list’ appearing in multiple sections of the book), we forced those candidates to be present only in the train set or only in the test set when splitting the data.

Features

To train our concept extractor, we used 25 types of features listed in Table 2. In total, we obtained 7661 features for this specific dataset. We categorized the features into four subsets – those that are linguistic, statistics-based, that use external resources, or that use a section title. Each subset represents different identifiers and cues that could help recognize concepts.

Table 2 Features are used in our concept extractor

Linguistic features

Linguistic features provide the most informative and significant cues to identify concepts. These features capture both internal (i.e., constituent words) and external (i.e., contextual) characteristics of the concept candidate.

  • POS (features 1-5): encode the part-of-speech structure of the candidate. This set of features helps to identify common patterns that concepts may have (e.g., noun + noun, or adjective + noun). In addition, we included separate POS features for specific tokens of the candidate, which could provide more fine-grained patterns for the extractor.

  • Context (features 6-17): describe the surrounding context of the candidate (e.g., the first word to left and the POS of the first word to left of the candidate).

  • Length of candidate: number of tokens the candidate has. As Table 1 shows, the distribution of different n-grams varies significantly.

Statistical features

In this section, we present several statistics-based features, which are inspired by work in information retrieval. These methods (also known as term-scoring methods) give a specific value to a candidate term, based on how it is distributed in the textbook. The central component of term scoring is term frequency.

  • Frequency (fre): how many times a candidate term occurs in a particular section. We created binary features where the frequency is less than or equal to 1, 2, 3, 4, 5, or 6. The intuition is that if a candidate term appears many times in a section, it may be a less informative and more generic term.

  • Collection frequency (cf): how many times a candidate term occurs in the entire textbook. We also created a set of cf -related binary features, where the frequency is considered up to a heuristic threshold of 50.

  • Term frequency-inversed document frequency (tf-idf): idf is a measure of the informativeness of the candidate term. A set of binary features was created for the log of tf-idf score (at various thresholds).

  • Language model (lang): this feature is evaluated based on the probability distribution of a foreground corpus (i.e., information retrieval) and a background corpus (i.e., a large corpus that encodes knowledge about the world). We use the content of the textbook as the foreground corpus and calculate the distribution for each of the candidates. For the background, we obtained the distribution of n-grams from the Bing Web Language Model API. We hypothesize that a candidate is more likely to be a concept if its probability distribution in the foreground corpus is significantly higher than in the background corpus.

External resources

These features attempt to improve the performance of the model by exploiting existing lexical knowledge bases, which are usually built by domain experts. These resources are independent from the training data. They can be directly computed without the need of labeling the training data. In this work, we leverage the following resources:

  • Wikipedia: We collected all IR-related Wikipedia article titles, based on the observation that a candidate is likely to be a concept if there is an article discussing it or some of its aspects. This collection is used to check if a candidate appears in any of these article titles. This feature is called Wikipedia title-based feature (or wTitle for short).

  • ACM Computer Science keyphrase repository: We assume that if a candidate appears in the the collection of keywords in the computer science domain published by ACM, it is very likely to be a concept.

Section titles

Book authors use section titles to inform readers of the topics, ideas, or problems that they are going to present. It is intuitive to assume that if a candidate appears in a section title, it should have a significant meaning that contributes to the topic. Therefore, we added one more feature to the model, called section title-based feature (or sTitle for short). Along with fre and cffeatures, sTitle allowed us to leverage the structured nature of a textbook as a tree of units with highly descriptive titles.

Static Evaluation: The Closeness of FACE to Expert Annotations

The Evaluation Approach

To evaluate our model and compare it with the baselines, we used several metrics: AUC, micro precision, micro recall, micro F1, macro precision, macro recall and macro F1. We computed the scores using exact matching. While we are aware of the limitation of exact matching for keyphrase extraction evaluation, this method remains the best solution for comparing models’ performance without humans in the loop.

Baselines

We compared our model with the following baselines.

  1. 1.

    Random model: The random model mimics the process of building a logistic regression model without training any model; it randomly assigns probabilities to candidates from 0 to 1 and uses a cutoff of 0.35 (i.e., the same as the main model) to classify concepts.

  2. 2.

    Linguistics model: The logistic regression model uses only the linguistic-based features (i.e., features 1-18) also used in Yih et al. (2006) and Nguyen et al. (2007)

  3. 3.

    Statistics model: The logistic regression model, which uses only the statistics-based features (i.e., features 19-22), also used in Hammouda et al. (2005), Witten et al. (1999), Rose et al. (2010), Nguyen et al. (2007), and Yih et al. (2006).

  4. 4.

    External resource baseline: The logistic regression model, which uses only the external resource-based features (i.e., features 23-24), also used in (Wang et al. 2015).

  5. 5.

    Title baseline: The logistic regression model, which uses only the title-based features (i.e., 24), also used in Wang et al. (2015) and Yih et al. (2006) for extracting concepts and Labutov et al. (2017) for predicting prerequisite and outcome concepts.

  6. 6.

    TextRank baseline: a well-known graph-based approach for keyphrase extraction (Mihalcea and Tarau 2004).

  7. 7.

    TopicRank baseline: a graph-based ranking method to discover topical representations for documents from which keyphrases are generated (Bougouin et al. 2013).

  8. 8.

    Rapid automatic keyword extraction (RAKE) baseline: an unsupervised, domain-independent and language-independent approach for extracting keywords from individual documents (Rose et al. 2010).

  9. 9.

    CopyRNN baseline: a RNN-based model using encoder-decoder architecture to predict keyphrases (Meng et al. 2017).

  10. 10.

    Humans/AMT baseline: We recruited three annotators from Amazon Mechanical TurkFootnote 3. The annotators were assigned to chapter 6 and 8 of the IIR book, including 13 sections (i.e., we chose these two chapters based on them having a reasonable amount of text for the annotation assignments).

Results

With this set of features, the logistic regression model with a 5-fold cross validation achieved an AUC score of 0.94 and a micro F1 score of 0.76 (see Table 3) for the concept classification task. This performance is significantly better than all of the partial models (i.e., those with different subsets of features). Among the partial models, the linguistic model performs the best. This means that for the task of concept classification from textbooks, language-based features, which take advantage of the syntactic structure and the context of candidates, are the most important signifiers. The statistics model also achieved a significant result.

Table 3 AUC, micro F1 and macro F1 of our model compared to the baselines. Significance testing only performed on the random and the partial models

Table 3 shows that our model outperforms all the baselines. Although RAKE achieved the highest macro recall of 63%, its precision was the lowest (and had a lower F1 score as a result).

Again, the linguistic model is the best among the partial models, achieving the F1 of 0.51. It also performed significantly better than the other baselines, including the human baseline (i.e., mechanical Turk).

CopyRNN, a deep neural net-based model, did not perform as well for this task as expected. This is likely due to the fact that we directly used the original model, which was trained on a different dataset (the paper abstract-keyword datasets) to predict concepts in the textbook. Our model can achieve a precision as high as 98% (at 19% recall) or recall as high as 97% (at 37% precision), depending on the preference of the user (see Fig. 2). Since concept extraction still remains a challenging task, it is difficult to simultaneously accomplish a high recall and a high precision. The availability to choose between a high recall or high precision could help to improve downstream tasks depending on what is more important. In the document linking task, for instance, we may want to achieve a high recall (i.e., identify many concepts) to distinguish documents. On the other hand, a high precision could result in better performance for student modeling and prediction tasks, which require more precise and more accurate concepts.

Fig. 2
figure 2

ROC curve from the main concept classifier

Error Analysis

Errors propagated to the final prediction stage from multiple sources. Some came from the preprocessing step due to noisy text, while others were from the model itself.

After careful data preprocessing and preparation, we were be able to obtain 97.72% of all the expert-annotated concepts; most of the missing concepts came from special characters (e.g., (pseudo-)relevance feedback) or errors of POS tagging.

As Table 4 shows, the model failed to identify most of the 4-gram concepts. 57% of unigrams were not recognized, which accounted for more than half of the false negative cases. On the other hand, there were predicted concepts from the model that could be considered as concepts but were not annotated by the experts; for example, optimization, Bayesian network, frequency-based feature selection, multinomial unigram language model. Some of the errors came from partial matching; for instance, maximum likelihood estimates is an actual concept, and the model predicts likelihood estimates as a concept.

Table 4 Concepts annotated by experts, but not predicted by the model (false negative)

For the candidates predicted by the model but not annotated by the experts (i.e., false positive), we asked an expert to additionally evaluate them. There are 13%, 30% and 30% of unigrams, bigrams, and trigrams respectively which could be considered as concepts based on the expert’s judgement. Those cases come from either the experts missing them during the annotation process or from partial matching.

For both the false negative or false positive cases, we can see that unigram candidates and concepts contribute to most of the failed cases, meaning that it’s harder to deal with unigram concepts, as compared to bigrams or trigrams. Moreover, as Table 4 shows, there are only 23% of actual bigram concepts that are not identified by the model. Though bigram concepts account for 56.45% of all the concepts, they are much easier to recognize.

Dynamic Evaluation: How Well FACE Works for Student Modeling

We have shown that FACE outperforms several state-of-the-art concept extraction methods, closest to Human Expert Annotations. In this section, we present how FACE performs in the context of student modeling. We also compare FACE with several baselines mentioned in the previous “Baselines”.

Student models are used to track student learning in online-learning platforms like massive open online courses (MOOCs) and intelligent tutoring systems (Corbett and Anderson 1995; Pavlik et al. 2009). These models are maintained by observing students working with learning materials and are used to adapt system behaviors to individual students; i.e., to recommend the most relevant materials or practice activities. Student models rely on expert annotated knowledge components (also known as knowledge units, concepts, or skills) to measure overall student performance.

Knowledge Components for Student Modeling

Modern student models are able to maintain the level of student knowledge for a set of knowledge components (KCs). KCs are the fundamental units on which overall student knowledge is measured. For example, a student practicing elementary mathematics problems might have to understand KCs like “Addition”, “Subtraction”, “Mulitplication” and “Division”. Traditionally, experts annotate practice activity or learning resource with KCs. Recent research on intelligent textbooks shows that automated concept extraction can be used to extract KCs (Thaker et al. 2018, 2019a) and apply them to student modeling. To evaluate and understand the quality of extracted concepts by FACE, we used them as KCs for Student models and measured the predictive power of the obtained student model. In the following subsections, we will discuss the system used, data collection procedure, and details of the experiment.

System and Dataset

The dataset used for this experiment was collected from Reading Circle (Guerra et al. 2013a), an online reading platform. This system was used in a graduate Information Retrieval course, which used the Introduction to Information Retrieval book studied in the previous sections. The system provides the student with an active reading environment, where they read the assigned textbook sections to prepare for the next class. Each section of the textbook is followed by a quiz, which allows students to assess how well they learned the content. There is no restriction on the number of attempts to answer the questions. Reading Circle logs all attempts made by the students. The dataset contains students’ time spent on reading sections and quiz performances. The dataset includes interactions from 22 students collected from the Spring 2016 semester. Thus, with this dataset and our FACE method, we have student performances on each activity, as well as concepts annotated with each activity. Details of the dataset are listed in Table 5.

Table 5 SM dataset statistics

Student Modeling Method

To assess the quality of each concept extraction method, we used the concepts extracted from each section and quiz as KCs to model students’ reading and quiz attempt behavior and attempted to predict their future performances. To perform this analysis, we used a comprehensive factor analysis model (CFM) (Thaker et al. 2019b). CFM is a logistic regression based model which takes students’ previous performances and reading behaviours to predict their success rate for a given question. We selected CFM to model student performance as it performs better on intelligent textbooks than other state-of-the-art student modeling approaches and also incorporates student reading behavior, which has proven to be beneficial in online textbook-based learning systems (Huang et al. 2016; Thaker et al. 2018).

To consider students’ reading behaviors, CFM uses students’ reading opportunity on KCs as input. The reading opportunity parameter assumes that student mastery of a knowledge unit improves with the opportunities the student has to read materials associated with the KCs. One reading opportunity is the duration for which a student has the text page opened. Thus, the reading opportunity starts when the student visits a particular page, and it ends when the student starts performing practice activities on that page or leaves the page to visit another page. For more details, refer to CFM (Thaker et al. 2019b).

Evaluation method

To evaluate the performance of CFM on student performance, we performed 5 fold cross validation with student stratified folds. First, we randomly selected 80% of students and put all their reading and quiz activity data into the training set. Then for the remaining 20% of students all their reading and quiz activity data into the test set.

The prediction are reported on quiz performance. The 5 fold cross validation is performed from the generated folds and Area Under the Receiver Operating Characteristic curve (AUC) and Root Mean Squared Error (RMSE) and F-score (F1) are reported. Larger AUC and F1 and lower RMSE numbers indicate better results. We further tried all the baselines of concept extraction to generate baselines for student modeling evaluation.

Evaluation Results and Discussion

In this section, we report our results, as compared with the baselines. This will help us to understand the quality of concepts extracted by FACE in comparison to the baseline methods. As Table 6 shows, we tried several baseline automated extraction methods presented in the previous section with FACE for student modeling results. The results show that the CopyRNN method outperformed all of the other methods listed in the table. As CopyRNN baseline outperformed human expert annotations, we became skeptic about the results and further tried to understand the differences between CopyRNN with other approaches. One thing came to our mind was CopyRNN never repeats a keyphrase and always selects the whole keyphrase, while other methods use entire keyphrases as well as keyphrases that are part of a whole keyphrase. This redundant information may cause errors in the process of modeling student knowledge. For example, “Probabilistic Graphical Model” as well as “Graphical Models” will both be extracted by the FACE method. To solve this, we added a filter step to all the methods, which selected only whole keyphrases and removed all sub-keyphrases. One explanation for this can be that a keyphrase (K1) can act as prerequisite to the text, given that there is already a keyphrase (K2) that has all the words in K1. After this filtering step, we again performed the student modeling and the results can be seen in Table 7. Table 7 shows that the human expert annotation method was the best-performing method. 70% of AUC through human annotation indicates that using concepts as KCs is a difficult problem. The results show that FACE is better in extracting KCs than other unsupervised (TextRank, RAKE, TopicRank) and supervised (CopyRNN) method and is the closest to the human expert annotation baseline. This is another evidence of the effectiveness of using FACE for keyphrase extraction.

Table 6 AUC, RMSE, and F1 of FACE, as compared to the baselines
Table 7 AUC, RMSE and F1 of FACE compared to the baselines

Conclusions and Future Work

In this paper, we present FACE, a supervised machine learning model with a list of rich, carefully hand-crafted features for automatic concept extractions from digital textbooks. We evaluated and compared the proposed model with several advanced keyphrase extraction models. Most importantly, augmenting earlier research in this direction, we assessed how well our model performs to support student modeling, the most critical component of any adaptive textbook. However, this work is limited to one book dataset. It could be extended and evaluated on multiple datasets to examine the general applicability of our proposed framework.

This work is a step towards the ultimate goal of developing a new generation of adaptive textbooks. There is still room to improve the model, for example by focusing on tackling uni-gram concepts which currently have the highest error rate. Another direction for work is utilizing deep neural networks to enhance the highly engineered feature sets presented in this work (Alzaidy et al. 2019). Our priority is to investigate how the outcomes of the current model could help improve typical adaptive textbook functionalities such as adaptive navigation support and content recommendation. We also plan to investigate the sensitivities of these “downstream” tasks to the different levels of precision and recall of concept keyword extraction. We believe that the work presented in this paper will help the research community towards building the next generation learning platforms for the Web.