Common Crawl maintains a free,open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.Overview
Over 250 billion pages spanning 15 years.Free and open corpus since 2007.Cited in over 10,000 research papers.3–5 billion new pages added each month.
We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of October, November, and December 2024. The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51.
Thom Vaughan
Thom is Principal Technologist at the Common Crawl Foundation.