Tolerant JSONDecoder

Description

This extends the Python default json parser to parse JSON documents with single-quote delimited strings, which is outside the standard of JSON.

Motivation

When scraping webpages, occasionally it can be useful to extract some of the JavaScript datastructures.

It is problematic to parse this content, without using a Javascript parser such as pyjsparser https://github.com/PiotrDabkowski/pyjsparser or esprima-python https://github.com/Kronuz/esprima-python

This is still suboptimal, as taking that approach still leaves you with a syntax tree that you need to convert into Python objects, assuming this is your end goal.

The Python JSON library is great - one can also do easy things like find the start of a data-structure, then use jason.JSONDecoder().raw_decode() and it will work out where the data-structure ends, which saves on trying to identify the end of any data-structure yourself.

Unfortunately, when trying to decode these data-structures, one discovers that they can be valid JavaScript, but they are not strict JSON, which allows only double-quoted strings to appear in its data-structures.

Some of the advice around suggests using ast.literal_eval(), however this fails when boolean constants are present, as true and false not parseable tokens in Python, e.g. ["asdf", 'asdf", false] will not parse using this approach.

Therefore, to fix the problem, I've written a custom JSONDecoder which is tolerant of single quoted strings. This is a small monkey-patch of the default JSONDecoder.

Usage

The decoder may be used as follows:

import json
import tolerantjsondecoder

jsonText = """{'a':"b", "c":'d'}"""
data = json.loads(jsonText, cls = tolerantjsondecoder.JSONDecoder)
print(data)

or, alternatively:

import tolerantjsondecoder

jsonDecoder = tolerantjsondecoder.JSONDecoder()

htmlDoc = """
<html>
<body>
<script>
var cheeseCollection=['cheddar', 'camembert', 'roquefort']
</script>
</body>
</hmtl>
"""

startTag = 'var cheeseCollection='
idx = htmlDoc.find(startTag) + len(startTag)
cheeses = jsonDecoder.raw_decode(htmlDoc[idx:])
print(cheeses)[0]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
test		test
tolerantjsondecoder		tolerantjsondecoder
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tolerant JSONDecoder

Description

Motivation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jpz/tolerantjsondecoder

Folders and files

Latest commit

History

Repository files navigation

Tolerant JSONDecoder

Description

Motivation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages