User Details
- User Since
- Sep 9 2019, 9:50 AM (273 w, 3 d)
- Availability
- Available
- IRC Nick
- mgerlach
- LDAP User
- MGerlach
- MediaWiki User
- MGerlach (WMF) [ Global Accounts ]
Fri, Nov 29
weekly update:
- first full revision of CfP for this year. Will share with co-chair/organizers before proceeding
- starting to put together a plan/templates for advertising the CfP after publication
weekly update:
- After successfully implementing the Aya-expanse-32b model, I am generating simple summaries for the set of sample articles from Web experiments model the model on our own GPUs on ML-lab servers (related to T380643); code: https://gitlab.wikimedia.org/repos/research/simple-summaries/-/blob/main/simple-summary_aya-expanse_experiment01.ipynb
- Looking at initial results (see spreadsheet). I have been detecting two issues: i) simple summaries are not much simpler in terms of readability score; ii) output sometimes (~20%) in different language. Testing different ways to adapt the prompt to mitigate the issue.
weekly update:
- After successfully implementing the Aya-expanse-32b model, I am generating simple summaries for set of sample articles from Web experiments model the model on our own GPUs on ML-lab servers
- Drafted documentation of the hypothesis work and added to the meta-page https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/FY24-25_WE.3.1.3_content_simplification
Fri, Nov 22
weekly update:
- ongoing coordination about the PC co-chair
- updated submission templates and repo https://gitlab.wikimedia.org/repos/research/wikiworkshop-templates
- Kinneret set up an OpenReview instance for submission https://openreview.net/forum?id=4MwSMQxue6
- revising the CfP (topics, troubleshooting for OpenReview sign-up)
weekly update:
- Coordinated with Web Team about filtering low-quality simple summaries for experiments. They applied one of the proposed guard-rail metrics to ensure factual consistency (meaning preservation) between the original article and the simple summary.
weekly update:
- We switched the model from Aya-23 to the next version Aya-expanse. https://phabricator.wikimedia.org/T379052#10314444
- I implemented the Aya-expanse models in our internal infrastructure using the new GPUs on the ml-lab servers. We run the model to generate simple summaries using the Aya-expanse-8b or Aya-expanse-32b model.
- I implemented some quantization techniques so that the models would fit into memory (the larger version does not work out of the box) https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/
- Specifically, we can use the model with different datatypes. For example, using float16 instead of the default float32 reduces the memory footprint of the model in half. In turn, this also reduces time needed for inference. At the same time, it is generally believed that this comes with little to no decrease in model performance.
- For example, the aya-expanse-32b model could not be loaded into memory of the GPU with the default datatype. Instead, using float16, the model’s memory footprint is 60.16GB and thus fits into memory of the GPU. Similarly, for the smaller Aya-expanse-8b the memory footprint decreases from 29.91GB to 14.95GB requiring only half the time to run a single example query (8s vs 3s).
- We can implement additional quantization techniques to further improve the memory footprint and inference time using, for example, the quanto library https://huggingface.co/blog/quanto-introduction This will require more thorough experiments to not only understand different options and potential dependencies that need to be resolved in LiftWing, but also make sure that model quality is preserved. I believe that this is beyond the scope of the current task and should be scoped as a separate task/hypothesis, if we have enough evidence that the model is useful in practice.
Thu, Nov 21
@Jdlrobson I found the following three tables:
Sat, Nov 16
weekly update:
- no updates because I was attending the team offsite during this week
weekly update:
- no updates because I was attending the team offsite during this week
weekly update:
- no updates because I was attending the team offsite during this week
Fri, Nov 8
weekly update:
- Started work with the ML Team on a dedicated subtask to test-deploy Aya models on LiftWing T379052
- While we have successfully tested the smaller Aya23-8b model, we have not been able to run the larger Aya23-35b model as it requires too much memory than is available in the available GPUs. We are thus testing the next generation of the Aya models (Aya-expanse) for test-deployment because they have a smaller memory footprint and thus might be easier to run in our infrastructure and, at the same time, are reportedly strictly better than the previous Aya-23 (so we would probably switch to the newer version in future experiments) while also supporting the same set of 23 languages.
- We ran first successful experiments with the larger Aya-expanse-32b model in the ML-Lab machines, where we were able to load and run inference.
weekly update:
- Coordinated timeline with Kinneret (make sure it aligns with Research Fund and other Wikimedia Research events) and revised slightly to accomodate that
- Drafted first version of updated CfP
- Starting revising submission templates
- Coordinated with Kinneret about creating an OpenReview instance in order to get a submission link that we can mention in the CfP
- Coordinated with Kinneret about updates to the Privacy Policy
- Main blocker for proceeding now with the CfP is that I am waiting for confirmation of the co-PC chair to finalize timeline and draft for the CfP
weekly updates:
- shared revised set of guardrail metrics for simple summaries with Web Team (googlesheet).
- most of the summaries are substantially simpler than the original and have relatively few grammatical issues.
- most importantly, the meaning preservation metric (summaC) seems very useful to filter simple summaries that are not consistent with the original (e.g. error messages or text that was not contained in the original article). the simple summaries with very low scores should be discarded for the first set of experiments as lower recall is not an issue.
- these guardrail metrics thus offer an option to filter out potentially low-quality simple summaries.
@isarantopoulos Thanks for the updates
Nov 5 2024
Update: @isarantopoulos did first experiments with the Aya-23-35B model. It does not work out of the box. The raw version of the model is 65GB on disk and does not fit into memory. We will explore some potential workarounds using a quantized model, e.g. using in8 datatype, to reduce footprint such that its compatible with our infrastructure.
Update: Aya-23-8B model runs successfully in LiftWing. thanks @isarantopoulos
Nov 1 2024
weekly update:
- Putting together update for public documentation of the model
- Set up meeting with ML Team next week to discuss test-deployment of aya23-35b model (used in the summaries experiment by Web Team) or potential alternative candidates
weekly update:
- Web Team generated simple summaries for a selected list of ~8K articles T375364 using the aya23-35b model from the Cohere API
- I am trying to evaluate the quality of the simple summaries by calculating 3 proxy metrics for simplicity, fluency, and meaning preservation (googledocs sheet). The aim is to identify low-quality summaries that should be filtered. Qualitatively inspecting the score for meaning preservation indicates that we can identify cases where the summary contains information that is not mentioned in the original article (negative scores or low scores close to 0). Plannning to inspect more samples if this approach for filtering makes sense.
weekly update:
- needed to get organized about todos for the Research track
- for CfP, the first step was to draft a rough timeline of dates (submission deadline, review period, etc). will consult with co-organizer and organizers from research fund etc.
- next step: revising the CfP from last year.
Oct 29 2024
Oct 28 2024
I would like to ask for adding 2 new papers to the landing page.
Oct 25 2024
weekly update:
- no update this week (no immediate asks for support this week)
weekly update:
- Ongoing work by ML Team to test-deploy the Aya-23 35B version of the model in LiftWing. Due to the size of the model, this requires some workarounds with datatypes which affects package dependencies.
- Selected and implemented three interpretable metrics to automatically check the quality of the automatically generated simple summaries. This will serve as a tool to make informed decisions about whether the simple summaries meet some minimum quality requirements before considering use in practice or whether they should be discarded . These metrics capture: i) simplicity (is the model output simpler to read than the original?); ii) fluency (is the model output grammatically correct?); iii) meaning preservation (is the model output factually consistent with the original text?).
- Example notebook: https://gitlab.wikimedia.org/repos/research/text-simplification/-/blob/main/evaluate_simple-summaries.ipynb
Oct 24 2024
Oct 22 2024
@Prototyperspective thanks for reaching out.
In principle, yes. One model generates simplified versions of articles. It is trained on pairs of the same article existing in English Wikipedia and Simple English Wikipedia (dataset). One could use the model to then generate simplified versions of articles that do not yet exist in Simple English Wikipedia but are already in English Wikipedia (of which there are many). However, it currently considers only the plain text of the article and will thus not include any links or references (which would be crucial for a draft of an actual article). Please note, that this is exploratory research to assess the feasibility of such a model.
Oct 18 2024
weekly update:
- I have been mostly busy this week with catching up on what has happened over the past 2 months while I was out.
- I have been coordinating the work with ML team to deploy the larger 35B version of the Aya-23 model in a test-instance of LiftWing (the smaller 8B version was already tested successfully and thus constitutes a valid backup solution in case the former does not work)
weekly update:
- adapted to title to reflect that work is continuing in Q2
- spent most of my time to catch up with what are current needs from corresponding hypothesis owners
- the Web Team started small user tests using the model for generating simple summaries of sections (documentation) I had prepared leaving for sabbatical (T374638). Feedback from the small sample of users is very positive (report)
- I have been syncing with Jan about next experiments to generate simple summaries for the lead sections of 10K articles for use in larger experiments in the browser extension (T375364)
- We also started discussions about usage of https://vector-search.wmcloud.org/ endpoint in recommendation experiments (T374669). These are currently small scale-experiments but there are some questions about how this could be scaled when potentially using it in larger experiments.
Aug 16 2024
weekly update
- no updates this week as I was attending ACL 2024 conference.
weekly update
- no updates this week as I was attending ACL 2024 conference.
Aug 8 2024
weekly update:
- no updates this week as there were no requests for additional support so far
weekly update:
- Updated the project page on meta with current status: https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/FY24-25_WE.3.1.3_content_simplification
- Identified informative/interpretable metrics to evaluate performance of simplification models. For the manual evaluation, the most common approach is to judge 3 dimensions: simplicity (is the text simpler), fluency (is the text grammatical), and adequacy (does the text preserve the meaning). Looking into the literature, we can define automated metrics which approximate judgements along these dimensions.
- Simplicity: Measure the change in readability score using our readability scoring model for Wikipedia articles.
- Fluency: Count the number of grammatical errors using LanguageTool
- Adequacy: Measure the factual consistency between the original and simplified text to detect, e.g., “model hallucination”. This is a very active field of research and several methods have been proposed recently such as FactCC (based on a trained classification model), SummaC (based on textual entailment), or QuestEval (based on question generation and answering).
- The advantage of these metrics is that they are more interpretable and that they dont require reference samples from a ground truth dataset. This will hopefully make it easier to obtain confidence about whether models are “good enough” for potential use in production.
- Started to implement metrics so we can automatically measure model performance.
- Next step: Finalize implementation of metrics for evaluating model performance and run on representative sample with existing model.
Aug 2 2024
Update: I added a draft section about this to the handbook https://office.wikimedia.org/wiki/Research/Handbook/Communication#Communicating_with_public_media
weekly update:
- I put together documentation for article recommendations with different tools from Research for experiments WE.3.1.1 (doc)
- Shared documentation with hypothesis owner for feedback.
weekly update:
- Built two working model prototypes for content simplification/summarization
- 1) Simplification: Generate a simplified version of a section/paragraph using simpler language.
- 2) Section-gist: Generate a plain language summary of a section (i.e. combining simplification and summarization)
- Put together detailed documentation about the two models with examples and tutorial notebooks on how the models can be run (doc). I will add these updates to the project-meta page too in the next week or so.
- Exploring alternatives for automatic evaluation of models to make it easier to iterate through different model variants without the need for manual evaluation.
- Experimenting with the test-deployment in the staging instance of LiftWing (available thanks to the ML-Team) of the Aya-23 model (Aya-23-8B) which is used for the section gist model. The model works and returns requests in a reasonable time. Though the model quality seems substantially lower than the larger model I used in the experiments using the external API. ML Team is planning to test-deploy the larger model version (Aya-23-35B) in the next weeks.
Update: Fabian, Isaac, and I coordinated and left detailed feedback and comments in the design doc. Resolving the task as the request in the task description is completed. We will of course continue to engage in any follow-up discussions in the doc.
Jul 29 2024
For the next round of updates, could you add the following items:
- Add to Publications page and Knowledge Gaps publications:
- Mykola Trokhymovych, Indira Sen, Martin Gerlach. 2024. An Open Multilingual System for Scoring Readability of Wikipedia. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL '24).
- Also add blurb to Knowledge Gaps updates:
- Date: Aug 2024
- Title: A multilingual model for measuring readability
- Blurb: We published a new paper at ACL ‘24 where we develop a multilingual model to score the readability of Wikipedia articles across languages.
- Link: https://arxiv.org/abs/2406.01835
Jul 26 2024
weekly update
- Shared results on section-gists of Wikipedia articles (T369288#10018291) with Web team as one potential approach for experiments on summarization and simplification for readers.
weekly update:
- Ran small-scale experiments to automatically generate section-gists (i.e. plain-language summaries of a section) for Wikipedia articles using different models.
- Test-dataset with original and simplified text from only the lead section of 10 articles in English, German, Portuguese (see spreadsheet)
- Sample-dataset of several sections from the same article without the reference-simplification (see spreadsheet)
- We are able to automatically generate section-gists (plain-language summaries of a section) of Wikipedia articles across languages using the LLM Aya-23. Based on qualitative evaluation of small sample (<100), the results are promising that the model can provide a concise and easy-to-understand overview of a long piece of text. I successfully tested English, German, and Portuguese, but the model supports 23 languages supposedly covering nearly half the world’s population in terms of speakers (which is way more than any other comparable LLM that I am aware of).
- Evaluation of text simplification is challenging. Automated metrics (such as SARI) dont seem to be very useful to rigorously assess model performance for specific use cases. Simple baselines such as returning a truncated version of the input-text can yield surprisingly good results. As a result, we should probably not rely on these metrics and, instead, resort to manually judging (samples of) the model output which is, however, much more resource-intensive.
Jul 19 2024
weekly update:
- Android reached out to understand more about recommended content within search. I shared some resources from research on search and articles recommendations (e.g. from list building)
- Web is starting with first experiments on recommendations in search. Providing support for using the article-similarity search from the list-building tool https://list-building.toolforge.org/
- Web is preparing to start thinking about experiments on simplifications which will happen later in the quarter. Ongoing discussions about what simplification would be useful and how to evaluate.
weekly summary:
- Clarified the criteria and constraints for the model
- Multilingual: support (some) languages other than English
- Openness: it needs to be open.
- Resources: We need to be able to host the model in our infrastructure in LiftWing with reasonable inference time.
- Use-case: We need to define the use-case; for example, should simplification be on the level of sentence, paragraph, section, or the full article?
- Quality: Ensure the output is useful in practice according to some metric
- Did a deep-dive on recent works on text simplification with language models (reviewing 26 papers from the past 2-3 years see below [1]). This helped me to understand the most common and promising strategies to approach the task and to identify challenges.
- Some of the main learnings came from a paper (Paper Plain) which aims to improve access to medical papers. Based on interviews with readers about barriers for interacting with content and usability testing, they identify section gists as a valuable and most frequently-used feature by non-expert readers.
- Operationalize simplification as a plain-language summary of a section
- For example, use a prompt: “Summarize content you are provided with for a fifth-grade student.”
- This approach seems highly effective as most LLMs are explicitly trained on the task of summarization across languages (e.g. XLSum)
- I also reviewed some of the recent large language models that could be good candidates. I identified the Aya-23 model as a promising candidate model
- Multilingual: It is multilingual supporting 23 languages (these languages cover approximately half the world’s population)
- It is an open-weight model and available in Hugging Face
- Given the successful test-deployment of the similarly-sized Gemma2-27B-it on LiftWing (T369055), it seems that this model could in principle be also hosted.
- The model can be prompted to generate plain-language summaries of sections without fine-tuning
- Some initial tests with the API-endpoint looked promising across several languages.
Jul 12 2024
weekly update:
- we succesfully ran the second part of the pilot survey
- we will start analyzing the data next week. based on the results we will decide on next steps: either continue and scale, or re-assess the general approach. we will post results on the meta-page https://meta.wikimedia.org/wiki/Research:Understanding_perception_of_readability_in_Wikipedia#Pilot_survey:_version_3
- for now I am closing the task as the goal of running the survey was accomplished.
weekly update:
- I reached out to all hypothesis owners in WE.3.1 individually We also had a joint meeting to support Support WE.3.1. From this, I obtained a much clearer picture about the needed support
- WE.3.1.1: might need light support (consulting) for options to generate recommendations. Substantial support needed for experiments on simplification/summarization. Web Team is starting to think about specifications in more detail. So these are ongoing discussions at the moment.
- WE.3.1.4: No support needed at this point. They will focus on figuring out what work would need to be done for scaling search (e.g. morelike). They also want to start looking into vector search which would likely require some support from Research (e.g. creating vectors/embeddings). However, they are starting from scratch and during Q1 will start to figure out what they want and what support would be needed in the future.
- WE.3.1.5: No support needed at this point. They want to skip the use of orphans for the first round of experiments.
Weekly update:
- Systematizing open questions for successful model development
- Infrastructure requirements: size, performance (inference time)
- Product requirements: Languages, Quality, Use-case
- Model requirements: Openness (can we host), Effectiveness (does the model have a chance to perform well), training (supervised, in-context learning, zero-shot as is), evaluation
- Gathering input from product team (Web) on intended use to tailor model specifications (languages, input/output format, quality boundaries). These are ongoing discussions but should become more clear in the next week or so
- Learning about updates in ML infrastructure which expands the potential set of candidate models. The ML team announced that we will have new servers will GPUs available for model training and hosting. This will allow us to use larger (and potentially better) models. Similarly, I am following the test-deployment of the Gemma2-model on LiftWing T369055. If successful, this could constitute a promising candidate model for the simplification task (or models with similar size/architecture).
- Reviewing scientific literature to compile list of candidate models for the task. I have identified more than 10 recent papers (2023/2024) on using LLMs for text simplification approach. This is very insightful to understand which models are most promising for the specific task. For example, a promising model is Aya-23, an open model with dedicatedly multilingual support and, in principle, compatible with out infrastructure constraints (see above)..
Jul 10 2024
Jul 5 2024
weekly update:
- ran the first part of the survey this week. next week will be the second part.
Jul 4 2024
Jul 1 2024
@MGerlach if you have any other observations, please add them to this task. We'll review this as part of the retro and it's good to have as we prepare for next year and know what we need to address in advance. I'll resolve the task after you're done :)
@KinneretG A gree with everything you wrote. In general though I did like OpenReview. One additional comment:
- setting of deadlines was public (e.g. for submission deadline of reviews for PC members). We often set the deadline in OpenReview to some time (e.g. a few hours or a day) after the officially communicated deadline in order to accommodate also those who run into technical or other issues. However, the internally set deadline is also visible publicly for the relevant folks. This lead to some confusion as to what the actual deadline is.
Moving this to the next quarter (FY2024-25-Research-July-September) as the work is not yet fully completed
- the MR for the pipeline in airflow is submitted.
- this needs some code-review by research-engineering before it can be merged. this should be completed by mid-July (see T361929#9935541)
- we expect to resolve the task in the next 1-2 weeks
We succesfully ran the research track at wikiworkshop2024.
Closing this task as all sub-tasks have been completed.
Moving to FY204-25-Research-July-September for now as we need 1 more week to run actually run the survey.
Jun 28 2024
weekly update:
- finished internal testing
- implemented two changes based on feedback: i) make the survey shorter by removing a few pairs, ii) add a question about how regularly participants use Wikipedia.
- figuring out available resources on prolific
- planning to deploy next week
weekly updates:
- finally I managed to spend some time on this.
- I figured out that one of the main bottlenecks was to calculate the indegree for each potential link-target (this is crucial since we want to prioritize articles with low indegree such as orphans). My initial approach was to use the Linkshere-API. However, this requires a separate call for each individual article. A much cheaper alternative is to query the replicas, with only a single query for potentially hundreds of articles for which we want to get the indegree (see the example in quarry). The replicas can be easily queried from toolforge (wikitech-documentation, example script in PAWS). For an example article, I could reduce the query-time 10-fold.
- I integrated this (and a few other fixes to improve the recommendations) into the latest version of the tool. Example: https://linkrec.toolforge.org/readmore?lang=en&title=Tiwanaku
- the example still takes some time but its much less likely to just timeout since the number of API calls is much smaller.
- in case, we need much more substantial speedups, we might want to resort to other heuristics such as the one mentioned above (T361944#9809372) using morelikethis in cirrussearch combined with the orphans-template
weekly update:
- finalized the mulitlingual experiments beyond English
- wrote up the results on the meta-page https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/First_round_of_experiments#Multilingual_experiments
- Main takeaway is that the model seems to work well for some language (but not all for others). "For a few languages scores were similar to English (German, Italian, Catalan); for many languages scores were slightly lower (Spanish, Basque, French , Portuguese, Dutch); and for some languages performance was substantially worse (Greek, Armenian, and Russian)"
Jun 21 2024
weekly updates:
- Ran and evaluated experiments with fine-tuned Flan-T5 on other languages. The main problem is that if the model is only trained on article pairs (original/simplified) in English, the model output (simplified) will most often be in English, even if the model input (original) is in, say, German or French. Thus, the model will perform a simplification AND translation in to English since it has seen only English samples of simplfiied text.
- As a potential solution I am experimenting with the following setup: i) fine-tune the model with additional samples from the other languages, ii) add a prefix "Simplify in <LANG>: " to the model-input. This will provide additional instructions for the model treating the simplifications in different languages as different but related tasks.
weekly update:
- internal testing of the limesurvey (including integration with prolific)
- next week: deployment on prolific
@DDeSouza could you please remove the pdf for the paper (the sooner the better so it doesnt get indexed by GoogleScholar etc):
"Structural Evolution of Co-Creation in Wikipedia [PDF]
Negin Maddah and Babak Heydari"
They opted out of having the pdf put on the website.
Thank you
Jun 14 2024
weekly update:
- no updates
weekly update:
- finalized the Limesurvey
- will test the Limesurvey internally next week (after that can be deployed)
weekly update:
- started setting up the experiments for evaluating fine-tuned Flan-T5 on other languages
- will run experiments next week
weekly update:
- finalized the program for the research track (sessions, session chairs) https://docs.google.com/spreadsheets/d/1KXCIitFfd57bRwuL30jciJEhipq4-wMI_hzvW0JmBCo/edit?gid=0#gid=0
- wrote a doc with instructions for session chairs
Jun 10 2024
May 30 2024
weekly update:
- no update. main focus was preparing for attending ICWSM T362416
weekly update:
- no update. main focus was preparing for attending ICWSM T362416
weekly update:
- no update. main focus was preparing for attending ICWSM T362416
weekly update:
- iterating on the sessions and the session chairs
- planning to finalize next week
May 27 2024
FInal update:
- send notifications to authors
- website will be updated with accepted extended abstracts and the members of the program committee
- also send thank-you email to reviewers and inviting them to register to attend the event
May 24 2024
weekly update:
- no update since my main focus was on Wiki Workshop T352543
weekly update:
- no update since my main focus was on Wiki Workshop T352543
weekly update:
- working on finalizing the limesurvey with manually verified samples
weekly update:
- finalized session format and alignment with workshop schedule
- created first draft of organizing papers into sessions
- created shortlist of session chairs
weekly udpate:
- made decisions about all submissions
- will be sending out notifications to authors next week (May 27)