8000 GitHub - c12i/kenya-power-pdf-extract: A prototype implementation on parsing Kenya Power interruption pdf documents to json format
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

c12i/kenya-power-pdf-extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kenya Power Interruptions PDF Extract

Parsing kenya power interruption data from their pdf files into json format

🧐 lemme tinker with the pdf file, see if I can parse the data

— collins muriuki (@collinsmuriuki_) July 12, 2022

Steps

First step is to actually derive the text content from the pdf file into string format. Luckily, rust crate, pdf-extract, handles this for us via it's extract_text function. PS: storing this data in a String type is not the most memory efficient method of going about this I must say, memory usage will be higher the bigger the pdf text size; we can make this compromise for this short demo.

The next bit is where the "fun" begins - make something meaningful from the junky text that we get back. First is to filter out what I consider as junk i.e text that doesn't really hold any meaningful data. This functionality is handled by the extract_text_from_pdf function

Next step is to break down the massive string into smaller chunks containing isolated outage information for a given area. The approach that was taken to do this was pretty simple, we split the huge string at "AREA:". See the FromStr implementation of the OutagesList

Now that we have a list of strings, we can figure out how we can handle a single string from the list. The main goal is to establish breakpoints in the remaining string, this was achieved through two regex objects - stored as lazy static variables:

  • DATE_RE - matches the date of the outage: With this we can derive the date of the outage as well as the string text that comes before the match; at this point we now have the region and the date
  • TIME_RE - matches the time range at which the outage will occur as well as the affected areas which is the string patterns that occurs after the date; at this point we now have the time and the areas.

What is left is to put everything together by creating two structs OutagesList and OutagesItem with their respective FromStr trait implementations. So that we finally have this in our main function:

use kenya_power_pdf_extract::{extract_text_from_pdf, OutagesList};

fn main() -> Result<(), anyhow::Error> {
    let args = std::env::args().collect::<Vec<_>>();
    let pdf_text = extract_text_from_pdf(&args[1])?;
    let outages_list = pdf_text.parse::<OutagesList>()?;
    println!("{:#?}", outages_list);
    Ok(())
}

Output snippet:

OutagesList {
    data: [
        OutagesItem {
            region: "PART OF KILIMANI, MILIMANI",
            date: "Monday 18.07.2022",
            time: "9.00 A.M. – 5.00 P.M.",
            areas: [
                "Part  of  Jabavu  Rd",
                "Woodlands",
                "DoD  Headquarters",
                "Woodlands  Mosque",
                "Part  ofHurlingum S/Centre",
                "Jabavu Court",
                "Chinese Embassy",
                "Russian Embassy",
                "Sri LankaEmbassy",
                "Jakaya  Kikwete  Rd",
                "Delamere  Flats",
                "Sagret  Hotel",
                "Comfort  Hotel",
                "SwizzHotel",
                "Ralph Bunch Rd",
                "Integrity Centre",
                "Middle East Bank",
                "Heron Portico",
                "PITMAN,Telkom  Plaza",
                "Adak  House  Nairobi  Central  SDA",
                "Nairobi  Area  Police",
                "Medical  &Dentist Board",
                "Lenana Rd & adjacent customers.",
            ],
        },
//...

Local Development

Requires rust and cargo installation.

Once that's done run:

cargo run ./files/kenya_power.pdf

Check the output folder for the resulting stdout output for both kenya_power_latest.pdf and kenya_power.pdf files in the files directory

Caveats

  • Only tested with 4 pdfs files derived from kplc.co.ke - Some edge cases might not be covered
  • Data is only grouped by AREA rather than REGION - can be fixed, decided to keep things simple for now

Authored by Collins Muriuki

This project is MIT licensed

About

A prototype implementation on parsing Kenya Power interruption pdf documents to json format

Resources

License

Stars

Watchers

Forks

Languages

0