[TOC]
Tip
This Markdown
document may contains some Mermaid diagrams so please consider install Typora to read/manage Markdown
files and don't miss any advanced feature.
This repository contains a web scraper that allows you to build the Spanish postal codes database from Sociedad Estatal de Correos y Telégrafos.
The application is built on top of PHP + Guzzle + concurrent requests to improve the performance.
Tool | Required/Recommended | Description |
---|---|---|
Git | Required | To interact with the VCS repository |
Docker | Required | To manage the development environment |
Make | Recommended | To interact with the development environment |
╔══════════════════════════════════════════════════════════════════════════════╗
║ ║
║ .: AVAILABLE COMMANDS :. ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
· show-context Setup: show context
· build Docker: builds the service
· up Docker: starts the service
· restart Docker: restarts the service
· down Docker: stops the service
· logs Docker: exposes the service logs
· bash Docker: establish a bash session into main container
· init Application: initializes the scrape process
Postal codes in Spain were created on July 1, 1984, when the Sociedad Estatal de Correos y Telégrafos (https://www.correos.es/) introduced automated mail sorting.
A postal code consists of a five-digit number between 01000..52999, where the first two digits (01..52) correspond to one of the 50 provinces of Spain or to one of the two Spanish autonomous cities on the African coast. The last three digits correspond to the postal codes available in each province.
You can find the list of provinces and their corresponding codes at the Instituto Nacional de Estadística.
For example:
Province | Province ID | Range | Possible Postcodes |
---|---|---|---|
Barcelona | 08 | 0..999 | 08000..08999 |
Málaga | 29 | 0..999 | 29000..29999 |
Ceuta | 51 | 0..999 | 51000..51999 |
Postal codes in Spain were created on July 1, 1984 when the Sociedad Estatal de Correos y Telégrafos, also known as Correos, introduced automated mail sorting. This company provides a web form from which postal codes can be found. If we open the Developer Tools / Network tab we will see that a background request is made to retrieve postal code suggestions.
Important
This application uses that endpoint to check if a postal code exists.
A postal code consists of a five-digit number between 01000..52999, where the first two digits (01..52) correspond to one of the 50 provinces of Spain or to one of the two Spanish autonomous cities on the African coast. The last three digits correspond to the postal codes available in each province.
Province | Province ID | Range | Possible Postcodes |
---|---|---|---|
Barcelona | 08 | 0..999 | 08000..08999 |
Málaga | 29 | 0..999 | 29000..29999 |
Ceuta | 51 | 0..999 | 51000..51999 |
Tip
You can find the list of Spanish provinces and their corresponding codes at the Instituto Nacional de Estadística.
GET https://api1.correos.es/digital-services/searchengines/api/v1/suggestions?text=08001
HTTP 200 OK
{
"suggestions": [
{
"text": "08001, Barcelona, Barcelona, Cataluña, ESP",
"longitude": 2.1686990270000592,
"latitude": 41.380160001000036
}
]
}
https://api1.correos.es/digital-services/searchengines/api/v1/suggestions?text=52999
HTTP 200 OK
{
"code": "404",
"message": "Not Found",
"moreInformation": {
"description": "Not results found.",
"link": "www.correos.es"
}
}
For each province this application generates all possible postal code combinations and checks their existence using the endpoint described above.
Important
If the answer is valid then it stores the postal code details into a CSV file for easy processing.
Important
The scrape process is performed by a PHPUnit unit tests because using tests allows long executions without any timeout and additionally it allows to validate the imported data in each iteration.
Just clone the repository into your preferred path:
$ mkdir -p ~/path/to/my-new-project && cd ~/path/to/my-new-project
$ git clone git@github.com:alcidesrc/scraping-correos-with-php.git .
$ make init
ℹ Stopping the service
[+] Running 2/2
✔ Container correos-app-run-7e80add999ab Removed 10.2s
✔ Network correos_default Removed 0.2s
✓ Task done!
ℹ Building the image
[+] Building 1.1s (24/24) FINISHED docker:default
=> [app internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 4.76kB 0.0s
=> [app internal] load .dockerignore ...
=> => naming to docker.io/library/correos:dev 0.0s
✓ Task done!
ℹ Installing PHP dependecies...
docker compose run --rm --user 1000:1000 app composer install
[+] Creating 1/1
✔ Network correos_default Created 0.1s
Installing dependencies from lock file (including require-dev)
Verifying lock file contents can be installed on current platform.
Nothing to install, update or remove
Generating optimized autoload files
32 packages you are using are looking for funding.
Use the `composer fund` command to find out more!
✓ Task done!
ℹ Generating CSV files...
PHPUnit 11.3.1 by Sebastian Bergmann and contributors.
Runtime: PHP 8.3.10 with PCOV 1.0.11
Configuration: /code/phpunit.xml
Random Seed: 2453001523
....................................................... 55 / 55 (100%)
Time: 18:15.756, Memory: 14.00 MB
Correos (Tests\Unit\Importers\Correos)
✔ Check exception is raised with wrong province
✔ Scrape province with ALBACETE
✔ Scrape province with ÁVILA
✔ Scrape province with CÓRDOBA
✔ Scrape province with ZARAGOZA
✔ Scrape province with MELILLA
✔ Scrape province with HUESCA
✔ Scrape province with RIOJA,·LA
✔ Scrape province with BIZKAIA
✔ Scrape province with CORUÑA,·A
✔ Scrape province with BURGOS
✔ Scrape province with TOLEDO
✔ Scrape province with CASTELLÓN/CASTELLÓ
✔ Scrape province with MADRID
✔ Scrape province with LLEIDA
✔ Scrape province with BADAJOZ
✔ Scrape province with ARABA/ÁLAVA
✔ Scrape province with GUADALAJARA
✔ Scrape province with CEUTA
✔ Scrape province with BARCELONA
✔ Scrape province with BALEARS,·ILLES
✔ Scrape province with VALLADOLID
✔ Scrape province with PALMAS,·LAS
✔ Scrape province with TERUEL
✔ Scrape province with ALICANTE/ALACANT
✔ Scrape province with SEVILLA
✔ Scrape province with CUENCA
✔ Scrape province with SALAMANCA
✔ Scrape province with GRANADA
✔ Scrape province with NAVARRA
✔ Scrape province with GIRONA
✔ Scrape province with SEGOVIA
✔ Scrape province with ALMERÍA
✔ Scrape province with SORIA
✔ Scrape province with MÁLAGA
✔ Scrape province with ZAMORA
✔ Scrape province with SANTA·CRUZ·DE·TENERIFE
✔ Scrape province with GIPUZKOA
✔ Scrape province with TARRAGONA
✔ Scrape province with LEÓN
✔ Scrape province with CÁDIZ
✔ Scrape province with HUELVA
✔ Scrape province with JAÉN
✔ Scrape province with CANTABRIA
✔ Scrape province with ASTURIAS
✔ Scrape province with CÁCERES
✔ Scrape province with CIUDAD·REAL
✔ Scrape province with PONTEVEDRA
✔ Scrape province with LUGO
✔ Scrape province with VALENCIA/VALÈNCIA
✔ Scrape province with PALENCIA
✔ Scrape province with MURCIA
✔ Scrape province with OURENSE
✔ Validate first postal code from specific provinces with ARABA/ÁLAVA
✔ Validate first postal code from specific provinces with ALBACETE
OK (55 tests, 110 assertions)
Generating code coverage report in HTML format ... done [00:00.015]
Code Coverage Report:
2024-08-20 12:36:15
Summary:
Classes: 50.00% (1/2)
Methods: 83.33% (5/6)
Lines: 97.18% (69/71)
App\CsvHandler
Methods: 50.00% ( 1/ 2) Lines: 88.89% ( 16/ 18)
App\Importers\Correos
Methods: 100.00% ( 4/ 4) Lines: 100.00% ( 53/ 53)
✓ Task done!
Tip
CSV files are stored at ./src/output/province-XX.csv
for easy processing
Important
Here you can find a Gist with all Spanish postal codes combined into one single file.
Please review our security policy on how to report security vulnerabilities:
PLEASE DON'T DISCLOSE SECURITY-RELATED ISSUES PUBLICLY
Only the latest major version receives security fixes.
If you discover a security vulnerability within this project, please open an issue here. All security vulnerabilities will be promptly addressed.
The MIT License (MIT). Please see LICENSE file for more information.