This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.
Note Quick legend on available resource types:
⭐ - open source project, usually a GitHub repository with its number of stars
📙 - resource you can read, usually a blog post or a paper
🗂️ - a collection of additional resources
🔱 - non-open source tool, framework or paid service
🎥️ - a resource you can watch
🎙️ - a resource you can listen to
Note Section keywords: paper summaries, compendium, awesome list
- 🗂️ The NLP Index - Searchable Index of NLP Papers by Quantum Stat / NLP Cypher
- ⭐ Awesome NLP by keon [GitHub, 16528 stars]
- ⭐ Speech and Natural Language Processing Awesome List by elaboshira [GitHub, 2189 stars]
- ⭐ Awesome Deep Learning for Natural Language Processing (NLP) [GitHub, 1274 stars]
- ⭐ Text Mining and Natural Language Processing Resources by stepthom [GitHub, 557 stars]
- 🗂️ Brainsources for #NLP enthusiasts by Philip Vollet
- ⭐ Awesome AI/ML/DL - NLP Section [GitHub, 1473 stars]
- 🗂️ NLP articles by Devopedia
- ⭐ 100 Must-Read NLP Papers 100 Must-Read NLP Papers [GitHub, 3732 stars]
- ⭐ NLP Paper Summaries by dair-ai [GitHub, 1475 stars]
- ⭐ Curated collection of papers for the NLP practitioner [GitHub, 1075 stars]
- ⭐ Papers on Textual Adversarial Attack and Defense [GitHub, 1501 stars]
- ⭐ Recent Deep Learning papers in NLU and RL by Valentin Malykh [GitHub, 296 stars]
- ⭐ A Survey of Surveys (NLP & ML): Collection of NLP Survey Papers [GitHub, 1997 stars]
- ⭐ A Paper List for Style Transfer in Text [GitHub, 1609 stars]
- 🎥 Video recordings index for papers
- ⭐ NLP top 10 conferences Compendium by soulbliss [GitHub, 459 stars]
- 📙 ICLR 2020 Trends
- 📙 SpacyIRL 2019 Conference in Overview
- 📙 Paper Digest - Conferences and Papers in Overview
- ⭐ NLP Progress by sebastianruder [GitHub, 22568 stars]
- ⭐ NLP Tasks by Kyubyong [GitHub, 3017 stars]
- ⭐ NLP Datasets by niderhoff [GitHub, 5741 stars]
- ⭐ Datasets by Huggingface [GitHub, 19096 stars]
- 🗂️ Big Bad NLP Database
- ⭐ UWA Unambiguous Word Annotations - Word Sense Disambiguation Dataset
- ⭐ MLDoc - Corpus for Multilingual Document Classification in Eight Language [GitHub, 152 stars]
- ⭐ Awesome Embedding Models by Hironsan [GitHub, 1752 stars]
- ⭐ Awesome list of Sentence Embeddings by Separius [GitHub, 2219 stars]
- ⭐ Awesome BERT by Jiakui [GitHub, 1846 stars]
- ⭐ The Super Duper NLP Repo [Website, 2020]
- ⭐ NLP Resources for Bahasa Indonesian [GitHub, 480 stars]
- ⭐ Indic NLP Catalog [GitHub, 552 stars]
- ⭐ Pre-trained language models for Vietnamese [GitHub, 653 stars]
- ⭐ Natural Language Toolkit for Indic Languages (iNLTK) [GitHub, 814 stars]
- ⭐ Indic NLP Library [GitHub, 550 stars]
- ⭐ AI4Bharat-IndicNLP Portal
- ⭐ ARBML - Implementation of many Arabic NLP and ML projects [GitHub, 387 stars]
- ⭐ zemberek-nlp - NLP tools for Turkish [GitHub, 1146 stars]
- ⭐ TDD AI - An open-source platform for all Turkish datasets, language models, and NLP tools.
- ⭐ KLUE - Korean Language Understanding Evaluation [GitHub, 560 stars]
- ⭐ Persian NLP Benchmark - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 73 stars]
- ⭐ nlp-greek - Greek language sources [GitHub, 5 stars]
- ⭐ Awesome NLP Resources for Hungarian [GitHub, 221 stars]
- ⭐ List of pre-trained NLP models [GitHub, 170 stars]
- ⭐ Pretrained language models developed by Huawei Noah's Ark Lab [GitHub, 3019 stars]
- ⭐ Spanish Language Models and resources [GitHub, 251 stars]
- ⭐ Modern Deep Learning Techniques Applied to Natural Language Processing [GitHub, 1328 stars]
- 📙 A Review of the Neural History of Natural Language Processing [Blog, October 2018]
- 📙 Natural Language Processing in 2020: The Year In Review [Blog, December 2020]
- 📙 ML and NLP Research Highlights of 2020 [Blog, January 2021]
🔙 Back to the Table of Contents
- 🎙️ NLP Highlights [Years: 2017 - now, Status: active]
- 🎙️ The NLP Zone Episodes [Years: 2021 - now, Status: active]
- 🎙️ TWIML AI [Years: 2016 - now, Status: active]
- 🎙️ Practical AI [Years: 2018 - now, Status: active]
- 🎙️ The Data Exchange [Years: 2019 - now, Status: active]
- 🎙️ Gradient Dissent [Years: 2020 - now, Status: active]
- 🎙️ Machine Learning Street Talk [Years: 2020 - now, Status: active]
- 🎙️ DataFramed - latest trends and insights on how to scale the impact of data science in organizations [Years: 2019 - now, Status: active]
- 🎙️ The Super Data Science Podcast [Years: 2016 - now, Status: active]
- 🎙️ Data Hack Radio [Years: 2018 - now, Status: active]
- 🎙️ AI Game Changers [Years: 2020, Status: active]
- 🎙️ The Analytics Show [Years: 2019 - now, Status: active]
- 📙 NLP News by Sebastian Ruder
- 📙 This Week in NLP by Robert Dale
- 📙 Papers with Code
- 📙 The Batch by deeplearning.ai
- 📙 Paper Digest by PaperDigest
- 📙 NLP Cypher by QuantumStat
- 🎥 NLP Zurich [YouTube Recordings]
- 🎥 Hacking-Machine-Learning [YouTube Recordings]
- 🎥 NY-NLP (New York)
- 🎥 Yannic Kilcher
- 🎥 HuggingFace
- 🎥 Kaggle Reading Group
- 🎥 Rasa Paper Reading
- 🎥 Stanford CS224N: NLP with Deep Learning
- 🎥 NLPxing
- 🎥 ML Explained - A.I. Socratic Circles - AISC
- 🎥 Deeplearning.ai
- 🎥 Machine Learning Street Talk
🔙 Back to the Table of Contents
- ⭐ GLUE - General Language Understanding Evaluation (GLUE) benchmark
- ⭐ SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
- ⭐ decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
- ⭐ dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [GitHub, 280 stars]
- ⭐ DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking
- ⭐ Big-Bench - collaborative benchmark for measuring and extrapolating the capabilities of language models [GitHub, 2835 stars]
- ⭐ WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
- ⭐ WikiLingua - A Multilingual Abstractive Summarization Dataset
- ⭐ SQuAD - Stanford Question Answering Dataset (SQuAD)
- ⭐ XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
- ⭐ GrailQA - Strongly Generalizable Question Answering (GrailQA)
- ⭐ CSQA - Complex Sequential Question Answering
- 📙 XTREME - Massively Multilingual Multi-task Benchmark
- ⭐ GLUECoS - A benchmark for code-switched NLP
- ⭐ IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
- ⭐ LinCE - Linguistic Code-Switching Evaluation Benchmark
- ⭐ Russian SuperGlue - Russian SuperGlue Benchmark
- ⭐ BLURB - Biomedical Language Understanding and Reasoning Benchmark
- ⭐ BLUE - Biomedical Language Understanding Evaluation benchmark
- ⭐ LexGLUE - A Benchmark Dataset for Legal Language Understanding in English
- ⭐ Long-Range Arena - Long Range Arena for Benchmarking Efficient Transformers (Pre-print) [GitHub, 716 stars]
- ⭐ SUPERB - Speech processing Universal PERformance Benchmark
- ⭐ CodeXGLUE - A benchmark dataset for code intelligence
- ⭐ CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
- ⭐ MultiNLI - Multi-Genre Natural Language Inference corpus
- ⭐ iSarcasm: A Dataset of Intended Sarcasm - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic
🔙 Back to the Table of Contents
- 📙 A Recipe for Training Neural Networks by Andrej Karpathy [Keywords: research, training, 2019]
- 📙 Recent Advances in NLP via Large Pre-Trained Language Models: A Survey [Paper, November 2021]
- ⭐ Pre-trained ELMo Representations for Many Languages [GitHub, 1458 stars]
- ⭐ sense2vec - Contextually-keyed word vectors [GitHub, 1617 stars]
- ⭐ wikipedia2vec [GitHub, 935 stars]
- ⭐ StarSpace [GitHub, 3938 stars]
- ⭐ fastText [GitHub, 25871 stars]
- 📙 Language Models and Contextualised Word Embeddings by David S. Batista [Blog, 2018]
- 📙 An Essential Guide to Pretrained Word Embeddings for NLP Practitioners by AnalyticsVidhya [Blog, 2020]
- 📙 Polyglot Word Embeddings Discover Language Clusters [Blog, 2020]
- 📙 The Illustrated Word2vec by Jay Alammar [Blog, 2019]
- ⭐ vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 644 stars]
- ⭐ sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]
- ⭐ bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1179 stars]
- ⭐ subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 2185 stars]
- ⭐ python-bpe - Byte Pair Encoding for Python [GitHub, 223 stars]
- 📙 The Transformer Family by Lilian Weng [Blog, 2020]
- 📙 Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
- 📙 Attention? Attention! by Lilian Weng [Blog, 2018]
- 📙 the transformer … “explained”? [Blog, 2019]
- 🎥️ Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
- 📙 Attention Is Off By One [July, 2023]
- 🎥️ Understanding and Applying Self-Attention for NLP [Talk, 2018]
- 📙 The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [Paper, April 2021]
- 📙 Pre-Trained Models: Past, Present and Future [Paper, June 2021]
- 📙 A Survey of Transformers [Paper, June 2021]
- 📙 The Annotated Transformer by Harvard NLP [Blog, 2018]
- 📙 The Illustrated Transformer by Jay Alammar [Blog, 2018]
- 📙 Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
- 📙 Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
- 📙 Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
- 📙 Reformer: The Efficient Transformer [Blog, 2020]
- 📙 Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
- 📙 TRANSFORMERS FROM SCRATCH [Blog, 2019]
- 📙 Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
- ⭐ Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub, 596 stars]
- 📙 Transformers from Scratch [Blog, Oct 2021]
- 📙 A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
- 📙 The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
- 📙 Understanding searches better than ever before [Blog, 2019]
- 📙 Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
- ⭐ SemBERT - Semantics-aware BERT for Language Understanding [GitHub, 286 stars]
- ⭐ BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub, 574 stars]
- ⭐ Optimal Subarchitecture Extraction for BERT [GitHub, 470 stars]
- ⭐ CharacterBERT: Reconciling ELMo and BERT [GitHub, 195 stars]
- 📙 When BERT Plays The Lottery, All Tickets Are Winning [Blog, Dec 2020]
- ⭐ BERT-related Papers a list of BERT-related papers [GitHub, 2032 stars]
- 📙 T5 Understanding Transformer-Based Self-Supervised Architectures [Blog, August 2020]
- 📙 T5: the Text-To-Text Transfer Transformer [Blog, 2020]
- ⭐ multilingual-t5 - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 1245 stars]
- 📙 Big Bird: Transformers for Longer Sequences original paper by Google Research [Paper, July 2020]
- 🎥️ Reformer: The Efficient Transformer - [Paper, February 2020] [Video, October 2020]
- 🎥️ Longformer: The Long-Document Transformer - [Paper, April 2020] [Video, April 2020]
- 🎥️ Linformer: Self-Attention with Linear Complexity - [Paper, June 2020] [Video, June 2020]
- 🎥️ Rethinking Attention with Performers - [Paper, September 2020] [Video, September 2020]
- ⭐ performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1084 stars]
- 📙 Switch Transformers: Scaling to Trillion Parameter Models original paper by Google Research [Paper, January 2021]
- 📙 The Illustrated GPT-2 by Jay Alammar [Blog, 2019]
- 📙 The Annotated GPT-2 by Aman Arora
- 📙 OpenAI’s GPT-2: the model, the hype, and the controversy by Ryan Lowe [Blog, 2019]
- 📙 How to generate text by Patrick von Platen [Blog, 2020]
- 📙 Zero Shot Learning for Text Classification by Amit Chaudhary [Blog, 2020]
- 📙 GPT-3 A Brief Summary by Leo Gao [Blog, 2020]
- 📙 GPT-3, a Giant Step for Deep Learning And NLP by Yoel Zeldes [Blog, June 2020]
- 📙 GPT-3 Language Model: A Technical Overview by Chuan Li [Blog, June 2020]
- 📙 Is it possible for language models to achieve language understanding? by Christopher Potts
- ⭐ Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 4589 stars]
- 🗂️ GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
- 🗂️ GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
- 🔱 OpenAI API - API Demo to use OpenAI GPT for commercial applications
- 📙 GPT-Neo - in-progress GPT-3 open source replication HuggingFace Hub
- ⭐ GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile
- 📙 Effectively using GPT-J with few-shot learning [Blog, July 2021]
- 📙 What is Two-Stream Self-Attention in XLNet by Xu LIANG [Blog, 2019]
- 📙 Visual Paper Summary: ALBERT (A Lite BERT) by Amit Chaudhary [Blog, 2020]
- 📙 Turing NLG by Microsoft
- 📙 Multi-Label Text Classification with XLNet by Josh Xin Jie Lee [Blog, 2019]
- ⭐ ELECTRA [GitHub, 2326 stars]
- ⭐ Performer implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1084 stars]
- 📙 Distilling knowledge from Neural Networks to build smaller and faster models by FloydHub [Blog, 2019]
- 📙 Compression of Deep Learning Models for Text: A Survey [Paper, April 2021]
- ⭐ Bert-squeeze - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 79 stars]
- ⭐ XtremeDistil - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 153 stars]
- 📙 PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
- ⭐ CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 146 stars]
- ⭐ XL-Sum - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 252 stars]
- ⭐ SummerTime - an open-source text summarization toolkit for non-experts [GitHub, 265 stars]
- ⭐ PRIMER - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 151 stars]
- ⭐ summarus - Models for automatic abstractive summarization [GitHub, 170 stars]
- 📙 Fusing Knowledge into Language Model [Presentation, Oct 2021]
Note Section keywords: best practices, MLOps
🔙 Back to the Table of Contents
- 🎥 In Search of Best Practices for NLP Projects [Slides, Dec. 2020]
- 🎥 EMNLP 2020: High Performance Natural Language Processing by Google Research, Recording, Nov. 2020]
- 📙 Practical Natural Language Processing - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]
- 📙 How to Structure and Manage NLP Projects [Blog, May 2021]
- 📙 Applied NLP Thinking - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021]
- 🎥 Introduction to NLP for Industry Use - DataTalksClub presentation on Introduction to NLP for Industry Use [Recording, December 2021]
- 📙 Measuring Embedding Drift - Best practices for monitoring drift of NLP models [Blog, December 2022]
MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.
In general, MLOps for NLP includes having the following processes in place:
- Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
- Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
- Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
- Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
- Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
- Data and Model Observability - track data drift, model accuracy drift etc.
Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:
- Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
- Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.
- ⭐ awesome-mlops [GitHub, 12526 stars]
- ⭐ best-of-ml-python [GitHub, 16309 stars]
- 🗂️ MLOps.Toys - a curated list of MLOps projects
- 📙 Machine Learning Operations (MLOps): Overview, Definition, and Architecture [Paper, May 2022]
- 📙 Requirements and Reference Architecture for MLOps:Insights from Industry [Paper, Oct 2022]
- 📙 MLOps: What It Is, Why it Matters, and How To Implement It by Neptune AI [Blog, July 2021]
- 📙 Best MLOps Tools You Need to Know as a Data Scientist by Neptune AI [Blog, July 2021]
- 📙 State of MLOps 2021 by Valohai [Blog, August 2021]
- 📙 The MLOps Stack by Valohai [Blog, October 2020]
- 📙 Data Version Control for Machine Learning Applications by Megagon AI [Blog, July 2021]
- 📙 The Rapid Evolution of the Canonical Stack for Machine Learning [Blog, July 2021]
- 📙 MLOps: Comprehensive Beginner’s Guide [Blog, March 2021]
- 📙 What I’ve learned about MLOps from speaking with 100+ ML practitioners [Blog, May 2021]
- 📙 DataRobot Challenger Models - MLOps Champion/Challenger Models
- 📙 State of MLOps Blog by Dr. Ori Cohen
- 📙 MLOps Ecosystem Overview [Blog, 2021]
- 🗂 MLOps cource by Made With ML
- 🗂 GitHub MLOps - collection of resources on how to facilitate Machine Learning Ops with GitHub
- 🗂 ML Observability Fundamentals Course Learn how to monitor and root-cause issues with production NLP models
- The MLOps Community - blogs, slack group, newsletter and more all about MLOps
- ⭐ DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- 🔱 Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- 🔱 SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
- ⭐ Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
- ⭐ Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
- ⭐ DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- ⭐ ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1696 stars]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- 🔱 Valohai - End-to-end ML pipelines [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- ⭐ CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]
- ⭐ TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- ⭐ WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 76 stars]
- ⭐ Great Expectations - Write tests for your data [GitHub, 9874 stars]
- ⭐ Deepchecks - Python package for comprehensively validating your machine learning models and data [GitHub, 3582 stars]
- ⭐ mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
- 🔱 Amazon SageMaker [Paid Service]
- 🔱 Valohai - End-to-end ML pipelines [Paid Service]
- 🔱 NLP Cloud - Production-ready NLP API [Paid Service]
- 🔱 Saturn Cloud [Paid Service]
- 🔱 SELDON - machine learning deployment for enterprise [Paid Service]
- 🔱 Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
- 🔱 polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
- ⭐ TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 4174 stars]
- 🔱 Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
- ⭐ KFServing - Serverless Inferencing on Kubernetes [GitHub, 3504 stars]
- 🔱 TFX - TensorFlow Extended - end-to-end platform for deploying production ML pipelines [Paid Service]
- 🔱 Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
- 🔱 Cortex - containers as a service on AWS [Paid Service]
- 🔱 Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service B41A ]
- ⭐ End2End Serverless Transformers On AWS Lambda [GitHub, 121 stars]
- ⭐ NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
- 🔱 Dagster - data orchestrator for machine learning [Free and Open Source]
- 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
- ⭐ flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 5525 stars]
- ⭐ MLRun - Machine Learning automation and tracking [GitHub, 1425 stars]
- 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
- ⭐ imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]
- ⭐ Cockpit - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 474 stars]
- ⭐ WeightWatcher - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 1453 stars]
- ⭐ Arize AI - embedding drift monitoring for NLP models
- ⭐ Arize-Phoenix - ML observability for LLMs, vision, language, and tabular models
- ⭐ whylogs - open source standard for data and ML logging [GitHub, 2636 stars]
- ⭐ Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 3843 stars]
- ⭐ MLRun - Machine Learning automation and tracking [GitHub, 1425 stars]
- 🔱 DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
- 🔱 Cortex - containers as a service on AWS [Paid Service]
- 🔱 Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
- 🔱 Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
- ⭐ Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
- 🔱 Fiddler - ML Model Performance Management Tool [Paid Service]
- 🔱 Hydrosphere - open-source platform for managing ML models [Paid Service]
- 🔱 Verta - AI and machine learning deployment and operations [Paid Service]
- 🔱 Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
- 🔱 Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
- 🔱 acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
- 🔱 Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
- 🔱 datakin - end-to-end, real-time data lineage solution [Paid Service]
- 🔱 Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
- 🔱 SODA - data monitoring, testing and validation [Paid Service]
- 🔱 Tecton - enterprise feature store for machine learning [Paid Service]
- ⭐ FEAST - open source feature store for machine learning Website [GitHub, 5525 stars]
- 🔱 Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]
- ⭐ ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 617 stars]
- 🔱 Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
- ⭐ Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 8093 stars]
- ⭐ kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 9883 stars]
- ⭐ Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 4353 stars]
- ⭐ ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 3972 stars]
- 🔱 Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
- ⭐ Diffgram - Complete training data platform for machine learning delivered as a single application [GitHub, 1834 stars]
- 🔱 Continual.ai - build, deploy, and operationalize ML models easier and faster with a declarative interface on cloud data warehouses like Snowflake, BigQuery, RedShift, and Databricks. [Paid Service]
🔙 Back to the Table of Contents
- 📙 Why BERT Fails in Commercial Environments by Intel AI [Blog, 2020]
- 📙 Fine Tuning BERT for Text Classification with FARM by Sebastian Guggisberg [Blog, 2020]
- ⭐ Pretrain Transformers Models in PyTorch using Hugging Face Transformers [GitHub, 254 stars]
- 🎥️ Practical NLP for the Real World [Presentation, 2019]
- 🎥️ From Paper to Product – How we implemented BERT by Christoph Henkelmann [Talk, 2020]
- ⭐ Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 776 stars]
- ⭐ Training BERT with Compute/Time (Academic) Budget [GitHub, 309 stars]
- ⭐ embedding-as-service [GitHub, 204 stars]
- ⭐ Bert-as-service [GitHub, 12399 stars]
- ⭐ NLP Recipes by microsoft [GitHub, 6367 stars]
- ⭐ NLP with Python by susanli2016 [GitHub, 2721 stars]
- ⭐ Basic Utilities for PyTorch NLP by PetrochukM [GitHub, 2210 stars]
- ⭐ Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 636 stars]
- ⭐ Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1688 stars]
- ⭐ FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 197 stars]
- ⭐ LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 692 stars]
- ⭐ NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
- ⭐ Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 613 stars]
- ⭐ BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 338 stars]
Note Section keywords: speech recognition
🔙 Back to the Table of Contents
- ⭐ wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
- ⭐ DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 25166 stars]
- 📙 Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
- ⭐ kaldi - Kaldi is a toolkit for speech recognition [GitHub, 14177 stars]
- ⭐ awesome-kaldi - resources for using Kaldi [GitHub, 532 stars]
- ⭐ ESPnet - End-to-End Speech Processing Toolkit [GitHub, 8355 stars]
- 📙 HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]
- ⭐ FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 857 stars]
- ⭐ TTS - a deep learning toolkit for Text-to-Speech [GitHub, 34356 stars]
- 🔱 NotebookLM - Google Gemini powered personal assistant / podcast generator
- ⭐ whisper - Robust Speech Recognition via Large-Scale Weak Supervision, by OpenAI [GitHub, 68884 stars]
- ⭐ vibe - GUI tool to work with whisper, multilingual and cuda support included [GitHub, 931 stars]
- ⭐ VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 507 stars]
Note Section keywords: topic modeling
🔙 Back to the Table of Contents
- 📙 Topic Modelling with PySpark and Spark NLP by Maria Obedkova [Spark, Blog, 2020]
- 📙 A Unique Approach to Short Text Clustering (Algorithmic Theory) by Brittany Bowers [Blog, 2020]
- ⭐ Top2Vec [GitHub, 2924 stars]
- ⭐ Anchored Correlation Explanation Topic Modeling [GitHub, 303 stars]
- ⭐ Topic Modeling in Embedding Spaces [GitHub, 540 stars] Paper
- ⭐ TopicNet - A high-level interface for BigARTM library [GitHub, 140 stars]
- ⭐ BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 6038 stars]
- ⭐ OCTIS - A python package to optimize and evaluate topic models [GitHub, 718 stars]
- ⭐ Contextualized Topic Models [GitHub, 1196 stars]
- ⭐ GSDMM - GSDMM: Short text clustering [GitHub, 353 stars]
Note Section keywords: keyword extraction
🔙 Back to the Table of Contents
- ⭐ PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2132 stars]
- ⭐ textrank - TextRank implementation for Python 3 [GitHub, 1248 stars]
- ⭐ rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
- ⭐ yake - Single-document unsupervised keyword extraction [GitHub, 1632 stars]
- ⭐ RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 375 stars]
- ⭐ rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
- ⭐ flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5583 stars]
- ⭐ BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 254 stars]
- ⭐ keyBERT - Minimal keyword extraction with BERT [GitHub, 3471 stars]
- ⭐ KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 251 stars]
- 📙 Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts by Haowen Jiang [Blog, Feb 2021]
- 📙 How to Extract Relevant Keywords with KeyBERT [Blog, June 2021]