diff options
| author | Jules Laplace <julescarbon@gmail.com> | 2018-11-25 22:19:15 +0100 |
|---|---|---|
| committer | Jules Laplace <julescarbon@gmail.com> | 2018-11-25 22:19:15 +0100 |
| commit | ee3d0d98e19f1d8177d85af1866fd0ee431fe9ea (patch) | |
| tree | 41372528e78d4328bc2a47bbbabac7e809c58894 /scraper/README.md | |
| parent | 255b8178af1e25a71fd23703d30c0d1f74911f47 (diff) | |
moving stuff
Diffstat (limited to 'scraper/README.md')
| -rw-r--r-- | scraper/README.md | 144 |
1 files changed, 144 insertions, 0 deletions
diff --git a/scraper/README.md b/scraper/README.md new file mode 100644 index 00000000..964a3ee3 --- /dev/null +++ b/scraper/README.md @@ -0,0 +1,144 @@ +# megapixels dev + +## installation + +``` +conda create -n megapixels python=3.7 +pip install urllib3 +pip install requests +pip install simplejson +pip install click +pip install pdfminer.six +pip install csvtool +npm install +``` + +## workflow + +``` +Paper in spreadsheet -> paper_name + -> S2 Search API -> paper_id + -> S2 Paper API -> citations + -> S2 Dataset -> full records with PDF URLs, authors, more citations + -> wget -> .pdf files + -> pdfminer.six -> pdf text + -> text mining -> named entities (organizations) + -> Geocoding service -> lat/lngs +``` + +To begin, export `datasets/citations.csv` from the Google doc. + +--- + +## Extracting data from S2 / ORC + +The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Scholar (S2) / arXiv people. It may be downloaded here: + +http://labs.semanticscholar.org/corpus/ + +### s2-search.py + +Loads titles from citations file and queries the S2 search API to get paper IDs. + +### s2-papers.py + +Uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc. + +### s2-dump-ids.py + +Extract all the paper IDs and citation IDs from the queried papers. + +### s2-extract-papers.py + +Extracts papers from the ORC dataset which have been queried from the API. + +### s2-raw-papers.py + +Some papers are not in the ORC dataset and must be scraped from S2 directly. + +--- + +## Extracting data from Google Scholar + +Included in the content-script folder is a Chrome extension which scrapes Google Scholar through the browser, clicking the links and extracting PDF links, number of citations, etc, then saving a JSON file when it's done. Still requires work to process the output (crossreference with S2 and dump the PDFs). + +--- + +## Scraping Institutions + +Once you have the data from S2, you can scrape all the PDFs (and other URLs) you find, and then extract institutions from those and geocode them. + +### s2-dump-pdf-urls.py + +Dump PDF urls (and also IEEE urls etc) to CSV files. + +### s2-fetch-pdf.py + +Fetch the PDFs. + +### s2-fetch-doi.py + +Fetch the files listed in ieee.json and process them. + +### pdf_dump_first_page.sh + +Use pdfminer.six to extract the first page from the PDFs. + +### s2-pdf-first-pages.py + +Perform initial extraction of university-like terms, to be geocoded. + +### s2-doi-report.py + +Extract named entities from the scraped DOI links (IEEE, ACM, etc). + +### s2-geocode.py + +Geocode lists of entities using Nominativ. + +### s2-citation-report.py + +For each paper in the citations CSV, find the corresponding paper in the database, and get all the citations. +For each of the citations, try to find an address for each one. Embed the appropriate entries from institutions list and then render them on a leaflet map. + +--- + +## Cleaning the Data + +After scraping these universities, we got up to 47% match rate on papers from the dataset. However there is still more to solve: + +- Fix the geocoding - this must be done manually - we will dedupe the entries in the entities table, then extract specific entities from the dataset. +- Unknown addresses - we have addresses for some places but we need to a) geocode them again b) geocode just the city or something +- Match across multiple lines +- Empty addresses - some papers need to be gone through by hand? Maybe we can do digram/trigram analysis on the headings. Just finding common words would help. +- Make a list of bogus papers - ones where PDFminer returned empty results, or which did not contain the word ABSTRACT, or were too long. + +### expand-uni-lookup.py + +By now I had a list of institutions in `reports/all_institutions.csv` (done by merging the results of the geocoding, as I had done this on 4 computers and thus had 4 files of institutions). This file must be gone through manually. This technique geocoded around 47% of papers. + +At this point I moved `reports/all_institutions.csv` into the Google Sheets. All further results use the CSV on Google Sheets. + +### s2-pdf-report.py + +Generates reports of things from the PDFs that were not found. + +### s2-geocode-spreadsheet.py + +To add new institutions, simply list them in the spreadsheet with the lat/lng fields empty. Then run this script and anything missing a lat/lng will get one. + +### s2-citation-report.py + +Generate the main report with maps and citation lists. + +--- + +## Useful scripts for batch processing + +### split-csv.py + +Shuffle and split a CSV into multiple files. + +### merge-csv.py + +Merge a folder of CSVs into a single file, deduping based on the first column. |
