summaryrefslogtreecommitdiff
path: root/scraper/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'scraper/README.md')
-rw-r--r--scraper/README.md144
1 files changed, 144 insertions, 0 deletions
diff --git a/scraper/README.md b/scraper/README.md
new file mode 100644
index 00000000..964a3ee3
--- /dev/null
+++ b/scraper/README.md
@@ -0,0 +1,144 @@
+# megapixels dev
+
+## installation
+
+```
+conda create -n megapixels python=3.7
+pip install urllib3
+pip install requests
+pip install simplejson
+pip install click
+pip install pdfminer.six
+pip install csvtool
+npm install
+```
+
+## workflow
+
+```
+Paper in spreadsheet -> paper_name
+ -> S2 Search API -> paper_id
+ -> S2 Paper API -> citations
+ -> S2 Dataset -> full records with PDF URLs, authors, more citations
+ -> wget -> .pdf files
+ -> pdfminer.six -> pdf text
+ -> text mining -> named entities (organizations)
+ -> Geocoding service -> lat/lngs
+```
+
+To begin, export `datasets/citations.csv` from the Google doc.
+
+---
+
+## Extracting data from S2 / ORC
+
+The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Scholar (S2) / arXiv people. It may be downloaded here:
+
+http://labs.semanticscholar.org/corpus/
+
+### s2-search.py
+
+Loads titles from citations file and queries the S2 search API to get paper IDs.
+
+### s2-papers.py
+
+Uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc.
+
+### s2-dump-ids.py
+
+Extract all the paper IDs and citation IDs from the queried papers.
+
+### s2-extract-papers.py
+
+Extracts papers from the ORC dataset which have been queried from the API.
+
+### s2-raw-papers.py
+
+Some papers are not in the ORC dataset and must be scraped from S2 directly.
+
+---
+
+## Extracting data from Google Scholar
+
+Included in the content-script folder is a Chrome extension which scrapes Google Scholar through the browser, clicking the links and extracting PDF links, number of citations, etc, then saving a JSON file when it's done. Still requires work to process the output (crossreference with S2 and dump the PDFs).
+
+---
+
+## Scraping Institutions
+
+Once you have the data from S2, you can scrape all the PDFs (and other URLs) you find, and then extract institutions from those and geocode them.
+
+### s2-dump-pdf-urls.py
+
+Dump PDF urls (and also IEEE urls etc) to CSV files.
+
+### s2-fetch-pdf.py
+
+Fetch the PDFs.
+
+### s2-fetch-doi.py
+
+Fetch the files listed in ieee.json and process them.
+
+### pdf_dump_first_page.sh
+
+Use pdfminer.six to extract the first page from the PDFs.
+
+### s2-pdf-first-pages.py
+
+Perform initial extraction of university-like terms, to be geocoded.
+
+### s2-doi-report.py
+
+Extract named entities from the scraped DOI links (IEEE, ACM, etc).
+
+### s2-geocode.py
+
+Geocode lists of entities using Nominativ.
+
+### s2-citation-report.py
+
+For each paper in the citations CSV, find the corresponding paper in the database, and get all the citations.
+For each of the citations, try to find an address for each one. Embed the appropriate entries from institutions list and then render them on a leaflet map.
+
+---
+
+## Cleaning the Data
+
+After scraping these universities, we got up to 47% match rate on papers from the dataset. However there is still more to solve:
+
+- Fix the geocoding - this must be done manually - we will dedupe the entries in the entities table, then extract specific entities from the dataset.
+- Unknown addresses - we have addresses for some places but we need to a) geocode them again b) geocode just the city or something
+- Match across multiple lines
+- Empty addresses - some papers need to be gone through by hand? Maybe we can do digram/trigram analysis on the headings. Just finding common words would help.
+- Make a list of bogus papers - ones where PDFminer returned empty results, or which did not contain the word ABSTRACT, or were too long.
+
+### expand-uni-lookup.py
+
+By now I had a list of institutions in `reports/all_institutions.csv` (done by merging the results of the geocoding, as I had done this on 4 computers and thus had 4 files of institutions). This file must be gone through manually. This technique geocoded around 47% of papers.
+
+At this point I moved `reports/all_institutions.csv` into the Google Sheets. All further results use the CSV on Google Sheets.
+
+### s2-pdf-report.py
+
+Generates reports of things from the PDFs that were not found.
+
+### s2-geocode-spreadsheet.py
+
+To add new institutions, simply list them in the spreadsheet with the lat/lng fields empty. Then run this script and anything missing a lat/lng will get one.
+
+### s2-citation-report.py
+
+Generate the main report with maps and citation lists.
+
+---
+
+## Useful scripts for batch processing
+
+### split-csv.py
+
+Shuffle and split a CSV into multiple files.
+
+### merge-csv.py
+
+Merge a folder of CSVs into a single file, deduping based on the first column.