path

author: Jules Laplace <julescarbon@gmail.com> 2018-12-07 18:46:03 +0100
committer: Jules Laplace <julescarbon@gmail.com> 2018-12-07 18:46:03 +0100
commit: 588c96ab6d38f30bbef3aa773163b36838538355 (patch)
tree: 2fd92e67cbe9276de222c26c03b2082fb4ace52a /scraper
parent: 9d0c59efe26ac3607900ff1685eafe5572b06400 (diff)
2 files changed, 8 insertions, 2 deletions
diff --git a/scraper/README.md b/scraper/README.md
index 782fa30a..318bba9a 100644
--- a/scraper/README.md
+++ b/scraper/README.md
@@ -36,18 +36,24 @@ The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Sch
 
 http://labs.semanticscholar.org/corpus/
 
+We do a two-stage fetch process as only about 66% of their papers are in this dataset.
+
 ### s2-search.py
 
 Loads titles from citations file and queries the S2 search API to get paper IDs, then uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc.
 
 ### s2-dump-ids.py
 
-Extract all the paper IDs and citation IDs from the queried papers.
+Dump all the paper IDs and citation IDs from the queried papers.
 
 ### s2-extract-papers.py
 
 Extracts papers from the ORC dataset which have been queried from the API.
 
+### s2-dump-missing-paper-ids.py
+
+Dump the citation IDs that were not found in the ORC dataset.
+
 ### s2-raw-papers.py
 
 Some papers are not in the ORC dataset and must be scraped from S2 directly.
diff --git a/scraper/s2-extract-papers.py b/scraper/s2-extract-papers.py
index bd30c24b..7cbe1244 100644
--- a/scraper/s2-extract-papers.py
+++ b/scraper/s2-extract-papers.py
@@ -5,7 +5,7 @@ import click
 from util import *
 
 S2_DIR = '/media/blue/undisclosed/semantic-scholar/corpus-2018-05-03'
-DATA_DIR = '/home/lens/undisclosed/megapixels_dev/datasets/s2/db_papers'
+DATA_DIR = '/home/lens/undisclosed/megapixels_dev/scraper/datasets/s2/db_papers'
 
 @click.command()
 @click.option('--fn', '-f', default='ids.json', help='List of IDs to extract from the big dataset.')
author	Jules Laplace <julescarbon@gmail.com>	2018-12-07 18:46:03 +0100
committer	Jules Laplace <julescarbon@gmail.com>	2018-12-07 18:46:03 +0100
commit	588c96ab6d38f30bbef3aa773163b36838538355 (patch)
tree	2fd92e67cbe9276de222c26c03b2082fb4ace52a /scraper
parent	9d0c59efe26ac3607900ff1685eafe5572b06400 (diff)