This is a python program that uses scrapy (I am running version 0.12 under Ubuntu 11.04) to scrape the mixmarket web site. It is pretty rudimentary, but functional.
The mixmarket.csv is an example output of Afghanistan and Brazil (hey, they were the first two countries).
To run:
scrapy crawl -L WARNING mixmarket
Notes:
- Scrapy documentation:
<http://doc.scrapy.org/en/latest/>
<http://doc.scrapy.org/en/latest/intro/tutorial.html> - This doesn’t fully crawl the web site, it starts with the country page(s). The example configuration crawls Afghanistan and Brazil. See the variable “start_urls” in
mixmarket/spiders/mixmarket_spider.py
- The fields are extracted in the spider code
mixmarket/spiders/mixmarket_spider.py
- This is not extracting all of the available fields. Fields can be added by looking at example page source (^U in your web browser), finding the fields of interest, and then figuring out how to extract the desired information from the fields using XPath matching. The “scrapy shell ” technique from the tutorial is very helpful for hacking together a XPath match quickly.
- Example for extending
scrapy shell http://mixmarket.org/mfi/afs
Then at the interactive prompt:
hxs.select(‘//div[@class=“field field-type-text field-field-mfi-is-regulated”]/div/div/text()’).extract()
Out105: [u’\n no ’]
- Example for extending
- This was written on linux. It should work OK on Windows too by installing python and the libraries, but I’m Windows Free tm. On Ubuntu, it is “apt-get install python-scrapy” and boom it is functional. Gotta love it.
- If you add CSV fields, remember to add them to “mixmarket/settings.py”. If you want the fields to come out in a different order, reorder them there.
- Improving the source should be easy: it is just XPath matching techniques plus simple python.
- The scraping should be pretty robust. If the mixmarket guys redo their site, you will have to find and fix the XPath matches for the items, but that is not going to change much/often.
Copyright © 2011,2012 Gerald Van Baren, gvb @ unssw.com
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
“Software”), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.