summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md144
1 files changed, 0 insertions, 144 deletions
diff --git a/README.md b/README.md
deleted file mode 100644
index 964a3ee3..00000000
--- a/README.md
+++ /dev/null
@@ -1,144 +0,0 @@
-# megapixels dev
-
-## installation
-
-```
-conda create -n megapixels python=3.7
-pip install urllib3
-pip install requests
-pip install simplejson
-pip install click
-pip install pdfminer.six
-pip install csvtool
-npm install
-```
-
-## workflow
-
-```
-Paper in spreadsheet -> paper_name
- -> S2 Search API -> paper_id
- -> S2 Paper API -> citations
- -> S2 Dataset -> full records with PDF URLs, authors, more citations
- -> wget -> .pdf files
- -> pdfminer.six -> pdf text
- -> text mining -> named entities (organizations)
- -> Geocoding service -> lat/lngs
-```
-
-To begin, export `datasets/citations.csv` from the Google doc.
-
----
-
-## Extracting data from S2 / ORC
-
-The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Scholar (S2) / arXiv people. It may be downloaded here:
-
-http://labs.semanticscholar.org/corpus/
-
-### s2-search.py
-
-Loads titles from citations file and queries the S2 search API to get paper IDs.
-
-### s2-papers.py
-
-Uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc.
-
-### s2-dump-ids.py
-
-Extract all the paper IDs and citation IDs from the queried papers.
-
-### s2-extract-papers.py
-
-Extracts papers from the ORC dataset which have been queried from the API.
-
-### s2-raw-papers.py
-
-Some papers are not in the ORC dataset and must be scraped from S2 directly.
-
----
-
-## Extracting data from Google Scholar
-
-Included in the content-script folder is a Chrome extension which scrapes Google Scholar through the browser, clicking the links and extracting PDF links, number of citations, etc, then saving a JSON file when it's done. Still requires work to process the output (crossreference with S2 and dump the PDFs).
-
----
-
-## Scraping Institutions
-
-Once you have the data from S2, you can scrape all the PDFs (and other URLs) you find, and then extract institutions from those and geocode them.
-
-### s2-dump-pdf-urls.py
-
-Dump PDF urls (and also IEEE urls etc) to CSV files.
-
-### s2-fetch-pdf.py
-
-Fetch the PDFs.
-
-### s2-fetch-doi.py
-
-Fetch the files listed in ieee.json and process them.
-
-### pdf_dump_first_page.sh
-
-Use pdfminer.six to extract the first page from the PDFs.
-
-### s2-pdf-first-pages.py
-
-Perform initial extraction of university-like terms, to be geocoded.
-
-### s2-doi-report.py
-
-Extract named entities from the scraped DOI links (IEEE, ACM, etc).
-
-### s2-geocode.py
-
-Geocode lists of entities using Nominativ.
-
-### s2-citation-report.py
-
-For each paper in the citations CSV, find the corresponding paper in the database, and get all the citations.
-For each of the citations, try to find an address for each one. Embed the appropriate entries from institutions list and then render them on a leaflet map.
-
----
-
-## Cleaning the Data
-
-After scraping these universities, we got up to 47% match rate on papers from the dataset. However there is still more to solve:
-
-- Fix the geocoding - this must be done manually - we will dedupe the entries in the entities table, then extract specific entities from the dataset.
-- Unknown addresses - we have addresses for some places but we need to a) geocode them again b) geocode just the city or something
-- Match across multiple lines
-- Empty addresses - some papers need to be gone through by hand? Maybe we can do digram/trigram analysis on the headings. Just finding common words would help.
-- Make a list of bogus papers - ones where PDFminer returned empty results, or which did not contain the word ABSTRACT, or were too long.
-
-### expand-uni-lookup.py
-
-By now I had a list of institutions in `reports/all_institutions.csv` (done by merging the results of the geocoding, as I had done this on 4 computers and thus had 4 files of institutions). This file must be gone through manually. This technique geocoded around 47% of papers.
-
-At this point I moved `reports/all_institutions.csv` into the Google Sheets. All further results use the CSV on Google Sheets.
-
-### s2-pdf-report.py
-
-Generates reports of things from the PDFs that were not found.
-
-### s2-geocode-spreadsheet.py
-
-To add new institutions, simply list them in the spreadsheet with the lat/lng fields empty. Then run this script and anything missing a lat/lng will get one.
-
-### s2-citation-report.py
-
-Generate the main report with maps and citation lists.
-
----
-
-## Useful scripts for batch processing
-
-### split-csv.py
-
-Shuffle and split a CSV into multiple files.
-
-### merge-csv.py
-
-Merge a folder of CSVs into a single file, deduping based on the first column.