We describe the development of an automatic tool to assess the readability of text documents. Our readability assessment tool predicts elementary school grade levels of texts with high accuracy. The tool is developed using supervised machine learning techniques on text corpora annotated with grade levels and other indicators of reading difficulty. Various independent variables or features are extracted from texts and used for automatic classification. We systematically explore different feature inventories and evaluate the grade-level prediction of the resulting classifiers. Our evaluation comprises well-known features at various linguistic levels from the existing literature, such as those based on language modeling, part-of-speech, syntactic parse trees, and shallow text properties, including classic readability formulas like the Flesch-Kincaid Grade Level formula. We focus in particular on discourse features, including three novel feature sets based on the density of entities, lexical chains, and coreferential inference, as well as features derived from entity grids. We evaluate and compare these different feature sets in terms of accuracy and mean squared error by cross-validation. Generalization to different corpora or domains is assessed in two ways. First, using two corpora of texts and their manually simplified versions, we evaluate how well our readability assessment tool can discriminate between original and simplified texts. Second, we measure the correlation between grade levels predicted by our tool, expert ratings of text difficulty, and estimated latent difficulty derived from experiments involving adult participants with mild intellectual disabilities. The applications of this work include selection of reading material tailored to varying proficiency levels, ranking of documents by reading difficulty, and automatic document summarization and text simplification.
Recommendations
Is cross‐lingual readability assessment possible?
Most research efforts related to automatic readability assessment focus on the design of strategies that apply to a specific language. These state‐of‐the‐art strategies are highly dependent on linguistic features that best suit the language for which they ...
Semi-automatic Construction of Sight Words Dictionary for Filipino Text Readability
Knowledge Management and Acquisition for Intelligent SystemsAbstractReadability formulas consider word familiarity as one of the factors for predicting the readability of children’s books. Word familiarity is dependent on the frequency in which the words are encountered in daily reading. Often referred to as “...
Automatic readability assessment for people with intellectual disabilities
My research goal is to advance our understanding of, and quantify, what makes a text easy or difficult to read, in particular for readers with intellectual disabilities. Previous research in automatic readability assessment has looked at a limited class ...