Python wrapper for the arXiv API.
arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.
$ pip install arxiv
Verify the installation with
$ python setup.py test
In your Python script, include the line
import arxiv
arxiv.query(query="",
id_list=[],
max_results=None,
start = 0,
sort_by="relevance",
sort_order="descending",
prune=True,
iterative=False,
max_chunk_results=1000)
Argument | Type | Default |
---|---|---|
query |
string | "" |
id_list |
list of strings | [] |
max_results |
int | 10 |
start |
int | 0 |
sort_by |
string | "relevance" |
sort_order |
string | "descending" |
prune |
boolean | True |
iterative |
boolean | False |
max_chunk_results |
int | 1000 |
-
query
: an arXiv query string. Format documented here.- Note: multi-field queries must be space-delimited.
au:balents_leon AND cat:cond-mat.str-el
is valid;au:balents_leon+AND+cat:cond-mat.str-el
is not valid.
- Note: multi-field queries must be space-delimited.
-
id_list
: list of arXiv record IDs (typically of the format"0710.5765v1"
). -
max_results
: the maximum number of results returned by the query. -
start
: the offset of the first returned object from the arXiv query results. -
sort_by
: the arXiv field by which the result should be sorted. -
sort_order
: the sorting order, i.e. "ascending", "descending" or None. -
prune
: whenTrue
, received abstract objects will be simplified. -
iterative
: whenTrue
,query()
will return an iterator. Otherwise,query()
iterates internally and returns the full list of results. -
max_chunk_results
: the maximum number of abstracts ot be retrieved by a single internal request to the arXiv API.
Query examples:
import arxiv
# Keyword queries
arxiv.query(query="quantum", max_results=100)
# Multi-field queries
arxiv.query(query="au:balents_leon AND cat:cond-mat.str-el")
# Get single record by ID
arxiv.query(id_list=["1707.08567"])
# Get multiple records by ID
arxiv.query(id_list=["1707.08567", "1707.08567"])
# Get interator over query results
result = arxiv.query(query="quantum", max_chunk_results=10, iterative=True)
for paper in result():
print(paper)
For a more detailed description of the interaction between query
and id_list
, see this section of the arXiv documentation.
arxiv.arxiv.download(obj, dirpath='./', slugify=slugify, prefer_source_tarfile=False<
684C
/span>)
Argument | Type | Default | Required? |
---|---|---|---|
obj |
dict | N/A | Yes |
dirpath |
string | "./" |
No |
slugify |
function | arxiv.slugify |
No |
prefer_source_tarfile |
bool | False |
No |
-
obj
is a result object, one of a list returned by query().obj
must at minimum contain values corresponding topdf_url
andtitle
. -
dirpath
is the relative directory path to which the downloaded PDF will be saved. It defaults to the present working directory. -
slugify
is a function that processesobj
into a filename. By default,arxiv.download(obj)
prepends the object ID to the object title. -
If
prefer_source_tarfile
isTrue
, this function will download the source files forobj
––rather than the rendered PDF––in .tar.gz format.
import arxiv
# Query for a paper of interest, then download
paper = arxiv.query(id_list=["1707.08567"])[0]
arxiv.download(paper)
# You can skip the query step if you have the paper info!
paper2 = {"pdf_url": "http://arxiv.org/pdf/1707.08567v1",
"title": "The Paper Title"}
arxiv.download(paper2)
# Download the gzipped tar file
arxiv.download(paper,prefer_source_tarfile=True)
# Returns the object id
def custom_slugify(obj):
return obj.get('id').split('/')[-1]
# Download with a specified slugifier function
arxiv.download(paper, slugify=custom_slugify)