FR3030809A1

FR3030809A1 - METHOD FOR AUTOMATICALLY ANALYZING THE LITERARY QUALITY OF A TEXT

Info

Publication number: FR3030809A1
Application number: FR1463074A
Authority: FR
Inventors: Quentin Pleple
Original assignee: Shortedition
Current assignee: Shortedition
Priority date: 2014-12-22
Filing date: 2014-12-22
Publication date: 2016-06-24
Also published as: FR3030810A1; FR3030811A1; FR3030812A1

Abstract

L'invention concerne un procédé d'analyse de la qualité littéraire d'un texte, mise en œuvre par un programme d'ordinateur.The invention relates to a method for analyzing the literary quality of a text, implemented by a computer program.

Description

PROCEDE D'ANALYSE AUTOMATIQUE DE LA QUALITE LITTERAIRE D'UN TEXTE Domaine technique La présente invention concerne un procédé d'analyse automatique de la qualité littéraire d'un texte. Par « qualité littéraire d'un texte », on entend dans le cadre de l'invention, la qualité littéraire d'un texte qui lui est intrinsèque. Cette qualité intrinsèque peut se matérialiser par une note discrète de 1 à 10, par un score continu dans [0, 1], comme un score réel [-00, +00] ou comme des labels « très bon », « bon », « moyen », etc.FIELD OF THE INVENTION The present invention relates to a method for automatically analyzing the literary quality of a text. By "literary quality of a text" is meant, in the context of the invention, the literary quality of a text that is intrinsic to it. This intrinsic quality can be materialized by a discrete score of 1 to 10, a continuous score in [0, 1], as a real score [-00, +00] or as "very good", "good", "Medium", etc.

Par « classification », on entend dans le cadre de l'invention, le sens usuel de la technique, à savoir la prédiction de la classe (valeur discrète), d'un ensemble de données à partir d'une base d'apprentissage pré-étiquetée. Par « régression », on entend dans le cadre de l'invention, le sens usuel de la technique, à savoir la prédiction de valeurs numériques et continues, associées à un ensemble de données à partir d'une base d'apprentissage. Etat de la technique De manière générale, le but de la catégorisation automatique de textes est d'apprendre à une machine informatique à classer un texte dans la bonne catégorie en se basant sur son contenu.By "classification" is meant within the scope of the invention, the usual meaning of the technique, namely the prediction of the class (discrete value), of a set of data from a pre-training basis. -étiquetée. By "regression" is meant within the scope of the invention, the usual meaning of the technique, namely the prediction of numerical values and continuous associated with a set of data from a learning base. STATE OF THE ART In general, the purpose of automatic text categorization is to teach a computer machine to classify a text in the correct category based on its content.

On peut résoudre par des algorithmes de catégorisation, divers problèmes de catégorisation de textes. En ce qui concerne l'analyse de la qualité d'un texte littéraire ou scientifique, différentes approches ont déjà été effectuées et différents algorithmes de catégorisation mis en oeuvre.Categorization algorithms can solve various problems of categorization of texts. Regarding the analysis of the quality of a literary or scientific text, different approaches have already been made and different categorization algorithms implemented.

Il existe ainsi plusieurs travaux qui concernent la qualité d'un texte littéraire mais la plupart ne sont pas pertinents car ils définissent la notion de qualité dans un sens qui leur est propre et donc non réellement indépendante des facteurs qui sont choisis. On peut citer le brevet US7200606 dans lequel la notion de qualité est considérée dans le sens de pertinence vis-à-vis d'une requête utilisateur.There are thus several works that concern the quality of a literary text but most are not relevant because they define the notion of quality in a sense that is their own and therefore not really independent of the factors that are chosen. One can quote the patent US7200606 in which the notion of quality is considered in the sense of relevance vis-à-vis a user request.

Ainsi, une des approches pertinentes est l'approche dite intrinsèque selon laquelle il s'agit d'utiliser des algorithmes de catégorisation pour classer des documents en fonction de caractéristiques textuelles (indicateurs) qui sont intrinsèques au texte: composition, éléments de style, précision du vocabulaire par rapport à un sujet, construction des raisonnements, orthographe, etc. Les caractéristiques de tri relèvent d'approches orthographiques, lexicales et stylistiques très variables, parmi lesquelles la longueur des mots, régularité du vocabulaire, analyse des cooccurrences, usage de la ponctuation, détection d'erreurs grammaticales et orthographiques, facilité de lecture, liens lexicaux avec un thème ou un genre, etc. Ces caractéristiques liées au texte peuvent être complétées utilement par des méthodes sémantiques faites autour des rapports entre qualité et respect des règles orthographiques et typographiques, de la grammaire (mesure de qualité sur des n-grammes longs), de la capitalisation, de la densité du texte (rapport entre lettres et espaces) ou de son entropie (au niveau des mots, voire au niveau des caractères). La lexicométrie, méthode d'analyse quantitative des textes, peut s'avérer un outil utile pour la mesure de qualité ou de non-qualité, paradoxalement. Quels que soient les méthodes et les algorithmes de catégorisation retenus, la difficulté première réside dans le choix des indicateurs et de l'algorithme, et dans leur combinaison pour évaluer la qualité littéraire d'un texte. On trouve peu de littérature qui s'intéresse à la qualité littéraire d'un texte par approche intrinsèque. On peut citer tout d'abord les publications [1] et [2] qui décrivent une extraction d'indicateurs intrinsèques à partir d'un texte littéraire brut puis une régression ou une classification pour atteindre la valeur cible que l'on cherche à déterminer. Le choix des indicateurs reste relativement sommaire, ce qui ne permet pas d'affiner avec une très bonne précision l'analyse de la qualité. La publication [3] divulgue une prédiction de la qualité à partir d'un nombre restreint d'articles de journal (journal « Wall Street Journal »). L'analyse selon cette publication reste basique, puisque seule une corrélation est établie entre chaque indicateur et une valeur cible réalisée sur une trentaine d'articles de référence. Il existe donc un besoin d'améliorer l'analyse de la qualité littéraire d'un texte, notamment en vue d'assurer une meilleure précision.Thus, one of the relevant approaches is the so-called intrinsic approach according to which it is a question of using categorization algorithms to classify documents according to textual characteristics (indicators) which are intrinsic to the text: composition, elements of style, precision vocabulary in relation to a subject, construction of reasoning, spelling, etc. The characteristics of sorting depend on very variable orthographic, lexical and stylistic approaches, among which the length of the words, regularity of the vocabulary, analysis of the cooccurrences, use of the punctuation, detection of grammatical and orthographical errors, facility of reading, lexical links with a theme or genre, etc. These characteristics related to the text can be usefully supplemented by semantic methods made around the relationship between quality and respect for orthographic and typographic rules, grammar (measurement of quality on long n-grams), capitalization, density of the text. text (relationship between letters and spaces) or its entropy (at the level of words, or even at the level of characters). Lexicometry, a method of quantitative analysis of texts, can prove to be a useful tool for measuring quality or non-quality, paradoxically. Whatever methods and categorization algorithms are used, the main difficulty lies in the choice of indicators and the algorithm, and in their combination to evaluate the literary quality of a text. There is little literature that focuses on the literary quality of a text by intrinsic approach. We can first of all cite the publications [1] and [2] which describe an extraction of intrinsic indicators from a raw literary text and then a regression or a classification to reach the target value that one seeks to determine. . The choice of indicators remains relatively brief, which makes it difficult to refine the quality analysis with a very good precision. The publication [3] discloses a prediction of quality from a small number of newspaper articles ("Wall Street Journal"). The analysis according to this publication remains basic, since only a correlation is established between each indicator and a target value carried out on about thirty reference articles. There is therefore a need to improve the analysis of the literary quality of a text, especially with a view to ensuring better accuracy.

Le but de l'invention est de répondre au moins en partie à ce besoin.The object of the invention is to respond at least in part to this need.

Exposé de l'invention Pour ce faire, l'invention a pour objet un procédé d'analyse de la qualité littéraire d'un texte, mise en oeuvre par un programme d'ordinateur, comprenant les étapes suivantes : a/ recevoir un texte brut à analyser; b/ générer plusieurs sous-représentations vectorielles du texte reçu pour obtenir des indicateurs, dits indicateurs bas-niveau, les sous-représentations consistant en : - une représentation par sac de mots selon laquelle on analyse les distributions de chaque mot et on analyse les distributions de certains unigrams, bi-grams, 3-grams, 4-grams, 5-grams et 6-grams à l'échelle du mot et des caractères, - une représentation dite de structure morphosyntaxique, selon laquelle on calcule les paramètres des distributions des mots grammaticaux dans le texte et on analyse les distributions de chaque fonction syntaxique dans le texte, les paragraphes, les phrases et les propositions, - une représentation des fautes d'écriture selon laquelle on calcule le nombre de fois où chaque règle de chacune des catégories de fautes d'écriture n'est pas respectée, - une représentation de stylométrie selon laquelle on calcule la longueur du texte, la longueur des paragraphes, la longueur des phrases, la longueur des propositions, la longueur des mots en caractères, le nombre de chaque signe de ponctuation, et enfin les paramètres de la distribution des dialogues dans le texte ; c/ générer: - une méta-description selon laquelle on analyse le vocabulaire du texte par les différents niveaux de rareté des mots, les champs lexicaux utilisés, les mots adaptés à la jeunesse, et on calcule des agrégations et ratios des indicateurs bas-niveau obtenus en b/ ; - une représentation des champs lexicaux présents dans le texte à partir de la représentation par sac de mots effectuée en b/, par analyse en composantes principales (PCA, acronyme anglais pour « Principal Components Analysis ») et/ou une analyse sémantique latente (LSA, acronyme anglais pour « Latent Semantic Analysis ») et/ou une factorisation en matrices non négatives (NMF, acronyme anglais pour « Non-negative Matrix Factorization »). d/ concaténation de sous-représentation vectorielles générées en b/ et c/ en une représentation finale du texte; e/ soumettre le vecteur final du texte à un classifieur dont le(les) algorithme(s) de classification est (sont) entraîné(s) à partir d'un ensemble de textes de référence dont la qualité littéraire a été mesurée numériquement par une population d'individus de préférence une population d'experts, puis pondérée. En ce qui concerne la mesure numérique des textes de référence, qui constitue la phase d'apprentissage du procédé selon l'invention, on peut procéder de la manière suivante.DISCLOSURE OF THE INVENTION To this end, the subject of the invention is a method for analyzing the literary quality of a text, implemented by a computer program, comprising the following steps: a / receiving a plain text to analyze; b / generate several vector sub-representations of the received text to obtain indicators, called low-level indicators, the sub-representations consisting of: - a representation by word bag according to which the distributions of each word are analyzed and the distributions are analyzed certain unigrams, bi-grams, 3-gram, 4-gram, 5-gram and 6-gram on the scale of the word and characters, - a so-called morphosyntactic structure representation, according to which one computes the parameters of the distributions of the grammatical words in the text and analyzes the distributions of each syntactic function in the text, the paragraphs, the sentences and the propositions, - a representation of the writing errors according to which one calculates the number of times each rule of each category of writing mistakes is not respected, - a representation of stylometry according to which one calculates the length of the text, the length of the paragraphs, the length of s sentences, the length of the propositions, the length of the words in characters, the number of each punctuation mark, and finally the parameters of the distribution of dialogs in the text; c / generate: - a meta-description according to which the vocabulary of the text is analyzed by the different levels of scarcity of words, the lexical fields used, the words adapted to youth, and we compute aggregations and ratios of low-level indicators obtained in b /; a representation of the lexical fields present in the text from the word bag representation performed in b /, principal component analysis (PCA) and / or a latent semantic analysis (LSA). , acronym for "Latent Semantic Analysis") and / or a factorization in non-negative matrixes (NMF, acronym for "Non-negative Matrix Factorization"). d / concatenation of vector under-representation generated in b / and c / in a final representation of the text; e / subjecting the final vector of the text to a classifier whose classification algorithm (s) is (are) driven from a set of reference texts whose literary quality has been measured numerically by a population of individuals preferably a population of experts, then weighted. With regard to the digital measurement of the reference texts, which constitutes the learning phase of the method according to the invention, the following procedure can be carried out.

Chaque individu d'une population déterminée, de préférence un membre- expert d'un comité éditorial, donne des séries de notes à un texte littéraire donné, qui sont indiquées dans une base de données. Les séries de notes sont données dans plusieurs catégories de texte, parmi lesquelles on peut citer les nouvelles, les poèmes, les bandes-dessinées.Each individual of a particular population, preferably an expert member of an editorial board, gives series of notes to a given literary text, which are indicated in a database. The series of notes are given in several categories of text, among which are the short stories, the poems, the comic strips.

Les séries de notes pour chaque individu et dans chacune des catégories de texte sont alors pondérées en étant centrées puis réduites selon l'équation: x' = (x - m) / s où : x est la note donnée entre 1 et 10 par un individu M pour une oeuvre d'une catégorie donnée C, m est la moyenne des notes données par M dans la catégorie C, s est l'écart-type des notes données par M dans la catégorie C, x' est la nouvelle note corrigée. Ainsi, dans la phase d'apprentissage, on réalise la représentation vectorielle de tous les textes de la base de données en mettant en oeuvre les étapes a/ à d/ et on entraîne le classifieur à reproduire la valeur cible que l'on cherche obtenir, c'est-à-dire la qualité littéraire. Après une recherche de l'état de l'art approfondie, l'inventeur a analysé qu'il n'y avait pas eu réellement de travail de fond sur l'analyse littéraire d'un texte.The series of notes for each individual and in each of the categories of text are then weighted by being centered then reduced according to the equation: x '= (x - m) / s where: x is the score given between 1 and 10 by a individual M for a work of a given category C, m is the average of the marks given by M in the category C, s is the standard deviation of the marks given by M in the category C, x 'is the new note corrected . Thus, in the learning phase, the vector representation of all the texts of the database is carried out by implementing steps a / d and the classifier is trained to reproduce the target value that is to be obtained. , that is to say the literary quality. After a thorough state-of-the-art research, the inventor analyzed that there was no real work on the literary analysis of a text.

Partant du constat que les ares méthodes connues d'analyse de la qualité littéraire n'étaient pas suffisamment précises, l'inventeur a pu en outre analyser que cela était dû principalement aux indicateurs simples qui étaient extraient de chaque texte.Starting from the observation that the known methods of analysis of literary quality were not sufficiently precise, the inventor was also able to analyze that this was mainly due to the simple indicators that were extracted from each text.

L'inventeur a alors pensé à combiner différentes représentations initiales pour obtenir des indicateurs plus complexes, puis à faire une concaténation de ces différentes représentations initiales afin d'avoir une représentation finale plus complète que celles selon l'état de l'art.The inventor then thought to combine different initial representations to obtain more complex indicators, then to concatenate these different initial representations in order to have a final representation more complete than those according to the state of the art.

Autrement dit, par le choix des indicateurs complexes, l'invention permet d'être plus exhaustive dans la mesure en lien avec la qualité littéraire intrinsèque d'un texte. L'analyse qualité littéraire obtenue grâce à l'invention, est d'autant plus précise que l'inventeur a pu construire une base de données fiable et importante avec un nombre élevé de textes évalués par des individus et de séries de notes attribuées. L'invention a comme avantage majeur de pouvoir traiter tous les textes littéraires avec les mêmes exigences en terme de qualité de textes et de faire l'analyse avec des mêmes délais quel que soit le texte. De préférence, l'analyse des distributions de chaque mot réalisée en b/ est faite après les traitements de TALN (Traitement Automatique de la Langue Naturelle). Selon un mode de réalisation avantageux, le calcul des paramètres de la distribution des mots grammaticaux et des dialogues dans le texte est fait à partir de critères choisis parmi la moyenne, la variance, l'écart type, l'entropie, la distance entre les distributions ou une combinaison de ceux-ci.In other words, by the choice of complex indicators, the invention makes it possible to be more exhaustive in the measurement related to the intrinsic literary quality of a text. The literary quality analysis obtained by virtue of the invention is all the more precise in that the inventor has been able to construct a reliable and important database with a large number of texts evaluated by individuals and series of scores attributed. The invention has the major advantage of being able to treat all literary texts with the same requirements in terms of quality of texts and to do the analysis with the same delays whatever the text. Preferably, the analysis of the distributions of each word performed in b / is made after the NLT (Automatic Natural Language Processing) treatments. According to an advantageous embodiment, the calculation of the parameters of the distribution of the grammatical words and the dialogues in the text is made from criteria chosen from the mean, the variance, the standard deviation, the entropy, the distance between the distributions or a combination thereof.

Le procédé comprend avantageusement la mise en oeuvre d'un ou plusieurs algorithmes de réduction de dimensionnalité sur les champs lexicaux obtenus en c/. Selon une première variante de réalisation, le(les) algorithme(s) de classification est (sont) adapté(s) pour permettre réaliser une classification binaire du texte en fonction de la qualité.The method advantageously comprises the implementation of one or more dimensionality reduction algorithms on the lexical fields obtained in c /. According to a first variant embodiment, the classification algorithm (s) is (are) adapted to enable a binary classification of the text according to the quality.

Selon une deuxième variante de réalisation, le(les) algorithme(s) de classification est (sont) adapté(s) pour permettre réaliser une classification multi-classe du texte en fonction de la qualité. L'invention concerne également un programme d'ordinateur de mise en oeuvre du procédé décrit précédemment.According to a second variant embodiment, the classification algorithm (s) is (are) adapted to enable a multi-class classification of the text according to the quality. The invention also relates to a computer program for implementing the method described above.

Description détaillée D'autres avantages et caractéristiques de l'invention ressortiront mieux à la lecture de la description détaillée d'exemples de mise en oeuvre de l'invention faite à titre illustratif et non limitatif en référence à la figure unique annexée, qui est un organigramme d'étapes du procédé selon l'invention mis en oeuvre par programme d'ordinateur. Par la suite, on utilise indifféremment les termes « algorithme » et «programme d'ordinateur » qui est le codage lisible par un ordinateur de l'algorithme.DETAILED DESCRIPTION Other advantages and features of the invention will become more apparent upon reading the detailed description of exemplary embodiments of the invention, given by way of non-limiting illustration with reference to the single appended figure, which is a flowchart of steps of the method according to the invention implemented by computer program. Subsequently, the terms "algorithm" and "computer program" which is the computer-readable coding of the algorithm are used interchangeably.

Ainsi, un algorithme est un plan d'exécution pour un ordinateur. L'ordinateur prend des données entrantes, applique le traitement décrit par l'algorithme et renvoie en retour un résultat à l'utilisateur. Dans le cadre de l'invention, l'algorithme mis en oeuvre pour l'analyse prédictive est un algorithme d'apprentissage automatique (« machine learning » en anglais). Dans ce type d'algorithme, ses règles de décisions ne sont pas fixées à la conception, car il est conçu pour qu'il puisse modifier ses règles de décisions, en fonction des données qu'il voit. Le procédé selon l'invention comprend deux phases successives, la première étant une phase d'apprentissage et la suivante étant la phase de prédiction.Thus, an algorithm is an execution plan for a computer. The computer takes incoming data, applies the processing described by the algorithm and returns a result to the user. In the context of the invention, the algorithm implemented for the predictive analysis is a machine learning algorithm. In this type of algorithm, his decision rules are not fixed to the design, because it is designed so that he can modify his decision rules, according to the data he sees. The method according to the invention comprises two successive phases, the first being a learning phase and the next being the prediction phase.

Ainsi, on réalise tout d'abord la phase d'apprentissage. Une population d'experts regroupés en un comité éditorial donne une série de notes pour chaque texte littéraire d'un ensemble de qui va servir d'ensemble de texte de référence. Ces notes mesurent la qualité littéraire de chaque texte et elle sont pondérées en étant centrées puis réduites selon l'équation: x' = (x - m) / s où : x est la note donnée entre 1 et 10 par un individu M pour une oeuvre d'une catégorie donnée C, m est la moyenne des notes données par M dans la catégorie C, s est l'écart-type des notes données par M dans la catégorie C, x' est la nouvelle note corrigée. On construit alors la représentation vectorielle de chaque texte littéraire évalué préalablement. Après avoir entraîné plusieurs algorithmes sur les données, c'est-à-dire après avoir fait réaliser l'apprentissage des paramètres par plusieurs algorithmes, on choisit celui qui est le plus performant.Thus, we first realize the learning phase. A population of experts grouped into an editorial committee gives a series of notes for each literary text of a set of which will serve as a set of reference text. These notes measure the literary quality of each text and are weighted by being centered and then reduced according to the equation: x '= (x - m) / s where: x is the score given between 1 and 10 by an individual M for a given of a given category C, m is the average of the notes given by M in the category C, s is the standard deviation of the notes given by M in the category C, x 'is the new corrected note. We then build the vector representation of each previously evaluated literary text. After having trained several algorithms on the data, that is to say after having made the learning of the parameters by several algorithms, one chooses the one which is the most efficient.

On estime alors les hyper-paramètres de l'algorithme en appliquant une méthode par validation croisée. Une fois les hyper-paramètres définis, on entraîne à nouveau l'algorithme sur l'ensemble des représentations vectorielles de tous les textes évalués préalablement.The hyper-parameters of the algorithm are then estimated by applying a cross-validation method. Once the hyper-parameters have been defined, the algorithm is again dragged onto the set of vector representations of all the previously evaluated texts.

Pour la phase de prédiction proprement dite d'un nouveau texte littéraire, on construit tout d'abord sa représentation vectorielle. Puis, on soumet le vecteur ainsi construit à l'algorithme entraîné qui va prédire une valeur pour la qualité littéraire du nouveau texte ainsi qu'une estimation de l'erreur. Autrement dit, on soumet le vecteur du texte à un classifieur.For the actual prediction phase of a new literary text, its vector representation is first constructed. Then, the vector thus constructed is submitted to the trained algorithm which will predict a value for the literary quality of the new text as well as an estimation of the error. In other words, we submit the vector of the text to a classifier.

On décrit maintenant plus en détail, la construction de la représentation vectorielle de chaque texte. L'algorithme réalise les étapes suivantes, à partir d'un texte brut a analyser (étape SO). Il génère plusieurs sous-représentations vectorielles (1) du texte reçu pour obtenir des indicateurs bas-niveau. La première sous-représentation consiste en une représentation par sac de mots (étape Si) selon laquelle on analyse les distributions de chaque mot et on analyse les distributions de certains unigrams, bi-grams, 3-grams, 4-grams, 5-grams et 6-grams à l'échelle du mot et des caractères. Ainsi, dans cette étape Si, le texte est transformé en une suite de tokens selon des expressions régulières de découpage. La représentation par sac- de-mots ne tient pas compte de la mise en forme du texte, de l'ordre des mots, de leur sens ou des relations structurées par des mots de liaison. La deuxième sous-représentation représente la structure morphosyntaxique (étape S2), selon laquelle on calcule les paramètres des distributions des mots grammaticaux dans le texte et on analyse les distributions de chaque fonction syntaxique dans le texte, les paragraphes, les phrases et les propositions. Les mots grammaticaux sont les articles, les prépositions, les adjectifs non qualificatifs. Le calcul des paramètres de la distribution des mots grammaticaux est fait à partir de critères choisis parmi la moyenne, la variance, l'écart type, l'entropie, la distance entre les distributions ou une combinaison de ceux-ci. Une fonction syntaxique est un verbe, un nom, un adjectif, un adverbe, un déterminant, une préposition. Ainsi, cette étape S2 permet d'extraire des éléments de structure du texte dans pour autant monter jusqu'au niveau pragmatique de la compréhension générale du texte. La troisième sous-représentation représente des fautes d'écriture (étape S3) selon laquelle on calcule le nombre de fois où chaque règle de chacune des catégories de fautes d'écriture n'est pas respectée. Les fautes d'écriture sont les fautes d'orthographe, de grammaire, de conjugaison, d'anglicisme, de syntaxe, d'expression, et d'usage. Ainsi, cette étape S3 consiste à analyser automatiquement les différents types de fautes apparaissant dans le texte. La quatrième sous-représentation représente la stylométrie (étape S4) selon laquelle on calcule la longueur du texte, la longueur des paragraphes, la longueur des phrases, la longueur des propositions, la longueur des mots en caractères, le nombre de chaque signe de ponctuation, et enfin les paramètres de la distribution des dialogues dans le texte. La longueur du texte est calculée à partir du nombre de paragraphes, phrases, propositions, mots, caractères. La longueur d'un paragraphe est calculée à partir du nombre de phrases, propositions, mots, caractères. La longueur des phrases est calculée à partir du nombre de propositions, mots, caractères. La longueur des propositions est calculée à partir du nombre de mots, caractères. Le calcul des paramètres de la distribution des dialogues dans le texte est fait à partir de critères choisis parmi la moyenne, la variance, l'écart type, l'entropie, la distance entre les distributions ou une combinaison de ceux-ci. Ainsi, cette étape S4 permet d'identifier le style du texte. A partir de toutes les sous-représentations précédentes, l'algorithme génère une cinquième sous-représentation qui est une méta-description (étape S5) selon laquelle on analyse le vocabulaire du texte par les différents niveaux de rareté des mots, les champs lexicaux utilisés, les mots adaptés à la jeunesse, et on calcule des agrégations (sommes) et ratios (divisions) des indicateurs bas-niveau obtenus précédemment. On donne ci-après un exemple d'agrégation calculé à partir d'indicateurs bas niveau qui sont les suivants: - NIN = nombre de verbes à l'infinitif - NPR = nombre de verbes au présent - NFU = nombre de verbes au futur - NPA = nombre de verbes au passé.The construction of the vector representation of each text is now described in more detail. The algorithm performs the following steps, from a raw text to be analyzed (step SO). It generates several vector sub-representations (1) of the received text to obtain low-level indicators. The first under-representation consists of a word bag representation (step Si) in which the distributions of each word are analyzed and the distributions of certain unigrams, bi-grams, 3-grams, 4-grams, 5-grams are analyzed. and 6-grams at the scale of the word and characters. Thus, in this step Si, the text is transformed into a sequence of tokens according to regular expressions of division. Word-bag representation does not take into account text formatting, word order, meaning, or relationships structured by linking words. The second underrepresentation represents the morphosyntactic structure (step S2), in which the parameters of the distributions of the grammatical words in the text are calculated and the distributions of each syntactic function are analyzed in the text, paragraphs, sentences and propositions. Grammatical words are articles, prepositions, non-qualifying adjectives. The calculation of the parameters of the distribution of the grammatical words is made from criteria chosen from the mean, the variance, the standard deviation, the entropy, the distance between the distributions or a combination of these. A syntactic function is a verb, a noun, an adjective, an adverb, a determinant, a preposition. Thus, this step S2 makes it possible to extract structural elements of the text in so far as to go up to the pragmatic level of the general comprehension of the text. The third underrepresentation represents writing errors (step S3) according to which the number of times each rule of each of the categories of writing faults is not respected is calculated. The errors of writing are the errors of spelling, grammar, conjugation, anglicism, syntax, expression, and use. Thus, this step S3 consists of automatically analyzing the different types of faults appearing in the text. The fourth underrepresentation represents the stylometry (step S4) according to which the length of the text, the length of the paragraphs, the length of the sentences, the length of the propositions, the length of the words in characters, the number of each punctuation mark are calculated. , and finally the parameters of the dialog distribution in the text. The length of the text is calculated from the number of paragraphs, sentences, propositions, words, characters. The length of a paragraph is calculated from the number of sentences, propositions, words, characters. The length of sentences is calculated from the number of propositions, words, characters. The length of the proposals is calculated from the number of words, characters. The calculation of the dialogue distribution parameters in the text is made from criteria chosen from the mean, the variance, the standard deviation, the entropy, the distance between the distributions or a combination of these. Thus, this step S4 makes it possible to identify the style of the text. From all the preceding sub-representations, the algorithm generates a fifth subrepresentation which is a meta-description (step S5) according to which the vocabulary of the text is analyzed by the different levels of rarity of the words, the lexical fields used. , the words adapted to youth, and one calculates aggregations (sums) and ratios (divisions) of the low-level indicators obtained previously. An example of aggregation calculated from low-level indicators is given below: - NIN = number of verbs in the infinitive - NPR = number of verbs in the present - NFU = number of verbs in the future - NPA = number of verbs in the past.

L'agrégation calculé donne un indicateur de niveau intermédiaire NV qui est le nombre total de verbes, soit NV = NIN + NPR + NFU + NPA. On donne ci-après un exemple de ratio calculé à partir d'indicateurs bas niveau qui sont les suivants: - NP = nombre de phrases - NV = nombre de verbes. Le ratio calculé donne un indicateur de niveau intermédiaire NM qui est le nombre moyen de verbes par phrases, soit NM = NV / NP. Ainsi, cette étape S5 permet d'obtenir des méta-descriptions telles que la lisibilité, l'étendue du vocabulaire ou la cohésion lexicale. A partir de la sous-représentation par sac-de-mots, l'algorithme génère une sixième sous-représentation qui représente des champs lexicaux (étape S6) présents dans le texte, par une analyse en composantes principales (PCA, acronyme anglais pour Principal Components Analysis ») et/ou une analyse sémantique latente (LSA, acronyme anglais pour « Latent Semantic Analysis ») et/ou une factorisation en matrices non négatives (NMF, acronyme anglais pour « Non-negative Matrix Factorization »). Il s'agit donc ici d'une étape de réduction de dimensionnalité pour obtenir des champs lexicaux. Lorsqu'on obtient trop de champs lexicaux par ces trois analyses, l'agorithme génère une étape supplémentaire de réduction de la dimensionnalité (étape S7). Cette étape S7 consiste donc à mettre tous les champs lexicaux ensemble et à n'en conserver qu'un nombre restreint afin que ceux conservés soient des champs uniques et pertinents. Autrement dit, en cas de redondance dans les composantes du vecteur généré selon l'étape S6, cette étape S7 permet de sélectionner les composantes non redondantes du vecteur. Une fois toutes les sous-représentations vectorielles générées, l'algorithme réalise leur concaténation (étape S8) en une représentation finale du texte. De nombreuses variantes et améliorations peuvent être envisagées sans pour autant sortir du cadre de l'invention. Ainsi, on peut envisager différentes variantes de classement en fonction de l'utilisation souhaitée.Calculated aggregation yields an intermediate level indicator NV which is the total number of verbs, NV = NIN + NPR + NFU + NPA. An example of a ratio calculated from low-level indicators is given below: - NP = number of sentences - NV = number of verbs. The calculated ratio gives an intermediate level indicator NM which is the average number of verbs per sentence, ie NM = NV / NP. Thus, this step S5 makes it possible to obtain meta-descriptions such as readability, range of vocabulary or lexical cohesion. From the bag-of-words sub-representation, the algorithm generates a sixth under-representation which represents lexical fields (step S6) present in the text, by a principal component analysis (PCA, acronym for Principal Components Analysis ") and / or latent semantic analysis (LSA) and / or non-negative Matrix Factorization (NMF). This is therefore a dimensionality reduction step to obtain lexical fields. When too many lexical fields are obtained by these three analyzes, the agorithm generates an additional step of reducing the dimensionality (step S7). This step S7 therefore consists in putting all the lexical fields together and keeping only a small number so that those kept are unique and relevant fields. In other words, in case of redundancy in the components of the vector generated according to step S6, this step S7 makes it possible to select the non-redundant components of the vector. Once all the vector sub-representations have been generated, the algorithm realizes their concatenation (step S8) into a final representation of the text. Many variants and improvements can be envisaged without departing from the scope of the invention. Thus, one can consider different variants of classification according to the desired use.

On peut ainsi classer les oeuvres par ordre de qualité littéraire plutôt que d'attribuer une note précise de la qualité pour chaque oeuvre. Il s'agit ainsi de positionner les textes d'un ensemble les uns par rapport aux autres.One can thus classify the works in order of literary quality rather than to attribute a precise note of the quality for each work. It is thus a question of positioning the texts of a set with respect to each other.

On peut réduire le classement à une régression ou une classification. Le classement peut-être dérivé du problème de régression, simplement en classant les textes selon la note attribuée par le programme mis en oeuvre. Une autre approche pour le classement est de le dériver du problème de classification. De la même manière que pour la régression, on peut classer les textes par probabilité d'appartenance à une classe. Il existe des algorithmes qui optimisent directement le classement d'objets. On peut utiliser ces algorithmes à partir de la représentation vectorielle de textes selon l'invention.Ranking can be reduced to a regression or classification. The ranking may be derived from the regression problem simply by classifying the texts according to the score assigned by the program implemented. Another approach for ranking is to derive it from the classification problem. In the same way as for regression, we can classify the texts by probability of belonging to a class. There are algorithms that directly optimize the classification of objects. These algorithms can be used from the vector representation of texts according to the invention.

A titre de variante, on peut envisager une classification binaire de textes en fonction de la qualité. On peut aussi envisager une classification multi-classe qui est une extension de la classification binaire. Cette fois-ci, il ne s'agit plus de répartir les textes dans deux sous-ensembles mais dans n sous-ensembles (n > 2). On peut par exemple chercher à faire quatre sous-ensembles égaux de qualité : les très bon, les bons, les médiocres et les mauvais. Un avantage de cette classification, c'est que l'on peut calculer les probabilités d'appartenance à chacune des classes, ce qui nous donne une distribution sur l'espace des notes, si les notes sont discrètes et vont de 1 à 10 par exemple. Une distribution sur les notes donne plus d'information qu'une simple régression.Alternatively, one can consider a binary classification of texts according to the quality. One can also consider a multi-class classification which is an extension of the binary classification. This time, it is no longer a question of dividing the texts into two subsets but into n subsets (n> 2). One can, for example, try to make four equal subsets of quality: the very good, the good, the mediocre and the bad. An advantage of this classification is that we can calculate the probabilities of belonging to each class, which gives us a distribution over the space of the notes, if the notes are discrete and go from 1 to 10 by example. A distribution on the notes gives more information than a simple regression.

Ainsi, apprendre la distribution pour une oeuvre particulière permet de répondre à des questions plus fines sur le texte comme: - savoir différencier une distribution constante (plate) d'une distribution à deux modes (en dos de chameau) qui divise les gens et créé la polémique, - apprécier la probabilité que le texte soit très ou très mauvais, - retrouver les valeurs découvertes avec la régression : la moyenne et l'écart type de la distribution. Selon encore une autre variante, on peut ne pas chercher à prédire la qualité littéraire universelle (ou la distribution un espace de note), mais plutôt rechercher à personnaliser la prédiction. Ainsi, on peut envisager de regrouper les annotateurs en grands types de profils identifiés, éventuellement par clustering ("data clustering" en anglais), ou à la main, et entraîner l'algorithme selon l'invention pour chaque groupe de lecteur identifié. L'avantage majeur de cette variante est de réduire le bruit dans la valeur cible, la qualité littéraire d'un texte, qu'on cherche à prédire. Si on regroupe les annotateurs par profils, on va créer des groupes d'utilisateurs qui arrivent en général à des consensus sur les textes. Créer un modèle par groupe d'annotateurs et identifier clairement leurs profils permet ensuite de faire différentes prédictions en fonction de l'utilisation qu'on souhaite faire de la prédiction. Si la prédiction est à usage d'une personne identifiée, on peut envisager alors dans un premier temps d'identifier le groupe à laquelle elle appartient et ensuite faire la prédiction avec l'analyse de qualité littéraire de ce groupe.10 REFERENCE S CITEES [1]: «DEFT2014, analyse automatique de textes littéraires et scientifiques en langue française», Lecluze and al., 21ème Traitement Automatique des Langues Naturelles, Marseille, 2014; [2] : « Catégorisation sémantique fine des expressions d'opinion pour la détection de consensus », Benamara and al., 21ème Traitement Automatique des Langues Naturelles, Marseille, 2014; [3] : « Revisiting Readability: A Unified Framework for Predicting Text Quality », Pitler and al. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 186-195.15Thus, learning the distribution for a particular work allows to answer to finer questions on the text like: - to be able to differentiate a constant distribution (flat) of a distribution with two modes (in camel's back) which divides the people and created the controversy, - appreciate the probability that the text is very or very bad, - find the values discovered with the regression: the mean and the standard deviation of the distribution. According to yet another variant, one may not seek to predict universal literary quality (or distribution a note space), but rather seek to personalize the prediction. Thus, it is conceivable to group the annotators into large types of identified profiles, possibly by clustering ("data clustering" in English), or by hand, and to drive the algorithm according to the invention for each identified reader group. The major advantage of this variant is to reduce the noise in the target value, the literary quality of a text, which one seeks to predict. If we group the annotators by profiles, we will create groups of users who generally arrive at consensus on the texts. Creating a template by group of annotators and clearly identifying their profiles then makes it possible to make different predictions according to the use that one wishes to make of the prediction. If the prediction is for the use of an identified person, we can then initially identify the group to which it belongs and then make the prediction with the literary quality analysis of this group.10 REFERENCE S CITEES [ 1]: "DEFT2014, automatic analysis of literary and scientific texts in French", Lecluze et al., 21st Automatic Processing of Natural Languages, Marseille, 2014; [2]: "Fine Semantic Categorization of Expression of Opinion for Consensus Detection", Benamara et al., 21st Automatic Language Processing, Marseille, 2014; [3]: "Revisiting Readability: A Unified Framework for Predicting Text Quality", Pitler et al. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 186-195.15

Claims

REVENDICATIONS1. A method for analyzing the literary quality of a text, implemented by a (micro) computer processor, comprising the steps of: a / receiving a raw text to be analyzed (step SO); b / generating several vector sub-representations (1) of the received text to obtain indicators, called low-level indicators, the sub-representations consisting of: - a representation by word bag (step Si) according to which the distributions of each word and analyzes the distributions of certain unigrams, bi-grams, 3-grams, 4-grams, 5-grams and 6-grams at the scale of the word and characters, a representation called morphosyntactic structure (step S2 ), according to which the parameters of the distributions of the grammatical words in the text are calculated and the distributions of each syntactic function in the text, the paragraphs, the sentences and the propositions are analyzed, - a representation of the writing errors (step S3) according to which one calculates the number of times each rule of each of the categories of writing faults is not respected, - a stylometry representation (step S4) according to which the length of d the text, the length of the paragraphs, the length of the sentences, the length of the sentences, the length of the words in characters, the number of each punctuation sign, and finally the parameters of the distribution of the dialogues in the text; c / generate (2): a meta-description (step S5) according to which the vocabulary of the text is analyzed by the different levels of rarity of the words, the lexical fields used, the words adapted to the youth, and computations are calculated and ratios of low-level indicators 25 obtained in b /; a representation of the lexical fields (step S6) present in the text from the word bag representation performed in b /, by principal component analysis (PCA) and / or an analysis Latent semantics (LSA) and / or non-negative matrix factorization (NMF) .d / concatenation (step S8) of vector-generated under-representation. in b / and c / in a final representation of the text; e / subjecting the final vector of the text to a classifier whose classification algorithm (s) is (are) driven from a set of reference texts whose literary quality has been measured numerically by a population of individuals, preferably a population of experts, and then weighted.

2. Analysis method according to claim 1, wherein the analysis of the distributions of each word performed in b / is made after the NLT (Automatic Natural Language Processing) treatments.

3. Analysis method according to claim 1 or 2, wherein the calculation of the parameters of the distribution of grammatical words and dialogues in the text is made from criteria selected from the mean, the variance, the standard deviation, entropy, distance between distributions, or a combination of these.

4. Analysis method according to one of claims 1 to 3, comprising the implementation of one or more dimensionality reduction algorithms (step S7) on the lexical fields obtained in c /.

5. Analysis method according to one of the preceding claims, the (the) classification algorithm (s) being adapted (s) to allow achieve a binary classification of the text according to the quality.

6. Analysis method according to one of claims 1 to 4, the (the) classification algorithm (s) being adapted (s) to enable a multi-class classification of text according to the quality.

A computer program comprising program code instructions for performing the steps of the method according to one of the preceding claims when said program is run on a computer.