8000 trafilatura · GitHub Topics · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
#

trafilatura

Here are 8 public repositories matching this topic...

Language: All
Filter by language

🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.

  • Updated Jun 7, 2025
  • Python

🤖 Collection of AI agents for web search, RAG, and multi-agent collaboration. Features phi-agent + Groq integration, Ollama support, DuckDuckGo/Google search, web scraping, and local knowledge base querying with vector embeddings.

  • Updated Jun 7, 2025
  • Python

This project is a Python-based web scraping tool that uses the Trafilatura library to extract and save text content from a list of specified websites. The program is designed to process multiple URLs, extract their main content, and save each website's content to a separate .txt file.

  • Updated Nov 1, 2024
  • Jupyter Notebook

Improve this page

Add a description, image, and links to the trafilatura topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the trafilatura topic, visit your repo's landing page and select "manage topics."

Learn more

0