FTP Service
NLM provides cloud service access to the PMC Open Access Subset and the PMC Author Manuscript Dataset for faster retrieval. As part of this service, content from these datasets is accessible to users on Amazon Web Services (AWS), without charge, through either an HTTPS or S3 URL, and without any log-in requirement for retrieval. Cloud Service documentation is available on the PMC Cloud Service and Accessing PMC Article Datasets Using AWS pages.
The PMC File Transfer Protocol (FTP) Service supports usage of the PMC Article Datasets with the following services:
- Available for: PMC Open Access Subset, Author Manuscript Dataset, and Historical OCR Dataset
- Packages include: XML or plain text files packaged in compressed baseline and daily incremental packages with each baseline containing 100's of thousands of articles (Note: The Historical OCR Dataset is only available in plain text format.)
- Available for: PMC Open Access Subset only
- Packages include: XML, PDF (if present), media files, and supplementary materials for a single article
- Available for: PMC Open Access Subset only
- Individual PDFs of articles: only available for non-commercial use licensed articles
PMC ID Cross-referencing
- Cross reference any PMC article ID with identifiers such as PubMed IDs, DOIs, and Author Manuscript IDs
- File: PMC-ids.csv.gz, a file in the top-level FTP directory
Base FTP URL: https://ftp.ncbi.nlm.nih.gov/pub/pmc
*Tip* If you are having difficulties with FTP, please consider trying the HTTPS protocol instead, e.g. [https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/oa_comm_xml.incr.2021-09-17.filelist.csv. NCBI also supports secure FTP via SFTP.
If you have questions or comments about the PMC FTP Service, please write to the PMC help desk. Further information on retrieving full text and other common developer queries can be found on Developer Resources page.
Bulk Download
If you only are interested in the metadata and text of an article or author manuscript, then bulk download may be what you want to use. Bulk packages group together hundreds of thousands of articles in XML or plain text formats in compressed packages (Note: The Historical OCR Dataset is only available in plain text format). If you are also interested in media files, supplementary materials, or PDFs, please see the sections on Individual Article Download and PDF Download.
Baseline Packages Update Schedule
New baseline packages will be created at least two times per year. Previous baseline and incremental packages and the accompanying file lists will be deleted whenever a new baseline is created.
New baselines will be created:
- mid-June
- mid-December
- as needed*
*PMC is sometimes required to suppress an article from public view for legal reasons if the case involves a legal injunction or a breach of patient privacy. In such cases, a new set of baseline packages will be created for the impacted dataset. This is not a frequent occurence.
Directories Organized by Dataset, License Terms, and File Content Type
Bulk downloads are available on the FTP Service by dataset:
PMC Open Access Subset - Bulk Author Manuscript Dataset - Bulk Historical OCR Dataset - Bulk
We have further divided the PMC Open Access Subset bulk packages into three groups based on available license terms:
- Commercial Use Allowed - CC0, CC BY, CC BY-SA, and CC BY-ND licenses
- Non-Commercial Use Only - CC BY-NC, CC BY-NC-SA, CC BY-NC-ND
- Other - no machine-readable license, no license, or a custom license
PMC OA Subset - Commercial Use PMC OA Subset - Non-Commercial Use Only PMC OA Subset - Other
To access the complete PMC OA Subset you will need to retrieve ALL of the OA Subset packages. These groups are complementary rather than duplicative.
Each of these datasets or groupings is divided into separate directories by file content type: XML (\xml
) and plain text (\txt
). The baseline packages for each of these OA Subset groups and for the Author Manuscript Dataset are divided by PMCID range (e.g., PMC004XXXXXX) in order to keep package sizes reasonable.
The result is the following directory structure:
|_ manuscript/
|___ txt/
|___ xml/
|_ oa_bulk/
|___ oa_comm/
|_____ txt/
|_____ xml/
|___ oa_noncomm/
|_____ txt/
|_____ xml/
|___ oa_other/
|_____ txt/
|_____ xml/
File Lists
There are csv and txt formatted file lists available for each package.
Note: Author manuscripts have different metadata information available than PMC OA Subset articles, so do not assume the same structure for the file lists for these two different datasets.
Sample Bulk File Names
- Baselist file list: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.filelist.csv
- Baseline: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.tar.gz
- Incremental file list: oa_comm_xml.incr.2021-09-17.filelist.csv
- Incremental update: oa_comm_xml.incr.2021-09-17.tar.gz
In each of the sample file names above you can substitute various parts to get to the files you want, e.g.
- Replace
oa_comm
withoa_noncomm
to get PMC OA Subset non-commerical use articles or replace withoa_other
to get PMC OA Subset articles without explicity tagged Creative Commons licenses. Replace it withauthor_manuscript
to get author manuscripts. - Replace
_xml
with_txt
to get plain text files vs. XML files - Replace
baseline
withincr
to switch from a baseline file to one of the daily incremental files, be sure to update the date and remove the PMC00#XXXXXX from the file name - Replace
PMC003XXXXXX
withPMC008XXXXXX
in baseline file names to get the articles in the specified grouping with PMCIDs in the range from PMC8000000 to PMC8999999; to get all articles you must retrieve all the PMCID ranges - Replace the date (e.g.
2021-09-16
) with the new baseline date if the baseline has been updated since this documentation was written; replace the date for incremental files with the date you want to retrieve - Replace
.csv
with.txt
as the file extension for the file list to get a tab separated plain text version of the file list
Individual Article Download (PMC Open Access Subset Only)
PMC Open Access Subset Individual Article Packages
If you only want to download some of the PMC OA Subset based on search criteria or if you want to download complete packages for articles that include XML, PDF, media, and supplementary materials, you will need to use the individual article download packages. To keep directories from getting too large, the packages have been randomly distributed into a two-level-deep directory structure. You can use the file lists in CSV or txt format to search for the location of specific files or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article metadata.
- Filenames: PMCXXXXXXX.tar.gz where the X's represent a specific PMCID
- File lists: oa_file_list.csv or oa_file_list.txt (Located up one level in the top level PMC FTP directory)
The first line of each file list is the timestamp the file was written. Subsequent rows contain metadata for each article.
Each row is divided into 6 metadata fields, delimited by comma or tab characters, for example:
oa_package/66/8b/PMC555938.tar.gz BMC Bioinformatics. 2005 Mar 7; 6:44 PMC555938 2023-06-11 23:35:18 15748298 CC BY no
The fields in the files are:
- The fully qualified name of the .tar.gz file for an article
- The article citation, comprising the journal title abbreviation, publication date, volume, issue, and the page range or elocation ID
- PMC accession number (PMCID)
- Last updated timestamp
- PubMed ID (PMID)
- License type - The value for "license type" can be any of the standard Creative Commons license variants (e.g., "CC BY"; "CC BY-NC"; "CC BY-NC-ND") or "NO-CC CODE". "NO-CC CODE" appears when the license is missing, has custom terms (i.e., not a Creative Commons license), or is not machine decodable.
- Retracted - The value for "Retracted" can be either "yes" or "no" to indicate whether this article is known by NLM to be retracted.
PDF Download (PMC Open Access Subset Only)
PMC Open Access Subset PDF Files
Individual article PDF downloads are only available for non-commercial use licensed articles. To keep directories from getting too large, the article PDFs have been randomly distributed into a two-level-deep directory structure. You can use the oa_non_comm_use_pdf file lists in CSV or txt format to search for the location of specific files, or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article citation and license information, as well as the date the article was last updated in PMC.
- Filenames: filename.PMCXXXXXXX.pdf where filename is the original name of the source file and the X's represent a specific PMCID
- File lists: oa_non_comm_use_pdf.csv or oa_non_comm_use_pdf.txt (Located in the top level PMC FTP directory)
License
Articles in these datasets are made available consistent with either the terms of applicable article-level license statements or the funder’s policy. See PMC Copyright for more information.
Contact
pubmedcentral@ncbi.nlm.nih.gov
How to Cite
See the individual dataset pages on how to cite the PMC Open Access Subset and PMC Author Manuscript Dataset.