Common Crawl - Open Repository of Web Crawl Data

Latest Blog Post:

Host- and Domain-Level Web Graphs October, November, and December 2024

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of October, November, and December 2024. The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51.

Thom Vaughan

Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Featured Papers:

Research on Free Expression Online

Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau

Banned Books: Analysis of Censorship on Amazon.com

Improved Trade-Offs Between Data Quality and Quantity for Long-Horizon Model Training

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Analyzing the Australian Web with Web Graphs: Harmonic Centrality at the Domain Level

Xian Gong, Paul X. McCarthy, Marian-Andrei Rizoiu, Paolo Boldi

Harmony in the Australian Domain Space

The Dangers of Hijacked Hyperlinks

Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal

Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains

Enhancing Computational Analysis

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Computation and Language

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas

esCorpius: A Massive Spanish Crawling Corpus

The Web as a Graph (Master's Thesis)

Marius Løvold Jørgensen, UiT Norges Arktiske Universitet

BacklinkDB: A Purpose-Built Backlink Database Management System

Internet Security: Phishing Websites

Asadullah Safi, Satwinder Singh

A Systematic Literature Review on Phishing Website Detection Techniques

Latest Blog Post:

Host- and Domain-Level Web Graphs October, November, and December 2024

The Data

Resources

Community

About