A comprehensive tool for tracking and visualizing website content changes by monitoring sitemaps over time. This system crawls website sitemaps, detects URL additions and removals, and generates interactive reports to visualize these changes.
- Sitemap Crawling: Automatically discovers and downloads all sitemaps for a domain
- Change Detection: Identifies new and deleted URLs between crawls
- CSV Exports: Saves all discovered URLs and their changes to CSV files
- HTML Reports: Generates interactive HTML reports with:
- Summary statistics of added/removed URLs
- Line charts showing URL changes over time
- Detailed per-crawl reports of specific URL changes
- Flexible Usage: Run as a standalone tool or integrate into larger workflows
pip install requests tldextract pandas ultimate-sitemap-parser tqdm jinja2
- Clone this repository
- Ensure the scripts are executable:
chmod +x differ.py reporter.py
./differ.py https://example.com
./differ.py https://example.com --verbose
./differ.py https://example.com --quiet
./reporter.py https://example.com
./reporter.py https://example.com --output-dir /path/to/reports
- Validates the input URL
- Creates output directories based on domain name and timestamp
- Discovers all sitemaps using the
usp
(Ultimate Sitemap Parser) library - Downloads each sitemap locally
- Extracts all page URLs and their source sitemaps
- Compares with previous runs to identify new and deleted URLs
- Generates a
diff.csv
file with all changes
- Finds all diff.csv files from previous crawler runs
- Aggregates data from all diffs
- Creates a main index report with summary statistics and trends chart
- Generates individual run reports for each crawler run that had changes
- Sets up all necessary HTML templates, CSS, and JavaScript
├── differ.py # Sitemap crawler and diff generator
├── reporter.py # HTML report generator
├── templates/ # HTML templates (created automatically)
│ ├── index.html # Main report template
│ └── run_report.html # Individual run report template
└── static/ # Static assets (created automatically)
├── css/
│ └── style.css # CSS styles for reports
└── js/
└── charts.js # JavaScript for charts
The tool organizes data by domain and timestamp:
example.com/
├── 1650640583/ # Timestamp of first run
│ ├── [sitemap files] # Downloaded sitemap files
│ └── urls.csv # All discovered URLs
├── 1650726983/ # Timestamp of second run
│ ├── [sitemap files]
│ ├── urls.csv
│ └── diff.csv # Changes since previous run
└── reports/ # Generated HTML reports
├── index.html
├── report_1650726983.html
└── static/
The generated reports include:
- Summary statistics (total runs, total URLs added/deleted)
- Interactive line chart showing URL changes over time
- Table of all runs with links to detailed reports
- Detailed per-run reports showing specific URLs added or removed
- Content Auditing: Monitor website growth or content pruning
- SEO Monitoring: Track indexable content changes
- Competitive Analysis: Monitor competitor website changes
- Content Migration Validation: Verify URLs are properly maintained during site migrations
- Automated Testing: Integrate into CI/CD pipelines to verify content deployment
The modular design makes it easy to extend:
- Modify
differ.py
to capture additional metadata from sitemaps - Update the HTML templates to add new visualizations
- Integrate with notification systems to alert on significant changes