Stars
🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提…
AgentQL is a suite of tools for connecting your AI to the web. Featuring a query language and Playwright integrations for interacting with elements and extracting data quickly, precisely, and at sc…
Official implement of paper "AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation" [EMNLP 24']
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
Formula recognition based on LaTeX-OCR and ONNXRuntime.
[ACM'MM 2024 Oral] Official code for "OneChart: Purify the Chart Structural Extraction via One Auxiliary Token"
Convert PDF to markdown + JSON quickly with high accuracy
🌳CED: Catalog Extraction from Documents
Prompt-learning methods used BERT4Keras (PET, EFL and NSP-BERT), both for Chinese and English.
A universal scalable machine learning model deployment solution
Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS ev…
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and…
2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.
An implementation of the Splitting and Merging table recognition method.
CDLA: A Chinese document layout analysis (CDLA) dataset
An implementation of the BERT model and its related downstream tasks based on the PyTorch framework. @月来客栈
使用Bert,ERNIE,进行中文文本分类
PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for conve…
Non-local Neural Networks for Video Classification