diff options
Diffstat (limited to 'README.md')
| -rw-r--r-- | README.md | 144 |
1 files changed, 0 insertions, 144 deletions
diff --git a/README.md b/README.md deleted file mode 100644 index 964a3ee3..00000000 --- a/README.md +++ /dev/null @@ -1,144 +0,0 @@ -# megapixels dev - -## installation - -``` -conda create -n megapixels python=3.7 -pip install urllib3 -pip install requests -pip install simplejson -pip install click -pip install pdfminer.six -pip install csvtool -npm install -``` - -## workflow - -``` -Paper in spreadsheet -> paper_name - -> S2 Search API -> paper_id - -> S2 Paper API -> citations - -> S2 Dataset -> full records with PDF URLs, authors, more citations - -> wget -> .pdf files - -> pdfminer.six -> pdf text - -> text mining -> named entities (organizations) - -> Geocoding service -> lat/lngs -``` - -To begin, export `datasets/citations.csv` from the Google doc. - ---- - -## Extracting data from S2 / ORC - -The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Scholar (S2) / arXiv people. It may be downloaded here: - -http://labs.semanticscholar.org/corpus/ - -### s2-search.py - -Loads titles from citations file and queries the S2 search API to get paper IDs. - -### s2-papers.py - -Uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc. - -### s2-dump-ids.py - -Extract all the paper IDs and citation IDs from the queried papers. - -### s2-extract-papers.py - -Extracts papers from the ORC dataset which have been queried from the API. - -### s2-raw-papers.py - -Some papers are not in the ORC dataset and must be scraped from S2 directly. - ---- - -## Extracting data from Google Scholar - -Included in the content-script folder is a Chrome extension which scrapes Google Scholar through the browser, clicking the links and extracting PDF links, number of citations, etc, then saving a JSON file when it's done. Still requires work to process the output (crossreference with S2 and dump the PDFs). - ---- - -## Scraping Institutions - -Once you have the data from S2, you can scrape all the PDFs (and other URLs) you find, and then extract institutions from those and geocode them. - -### s2-dump-pdf-urls.py - -Dump PDF urls (and also IEEE urls etc) to CSV files. - -### s2-fetch-pdf.py - -Fetch the PDFs. - -### s2-fetch-doi.py - -Fetch the files listed in ieee.json and process them. - -### pdf_dump_first_page.sh - -Use pdfminer.six to extract the first page from the PDFs. - -### s2-pdf-first-pages.py - -Perform initial extraction of university-like terms, to be geocoded. - -### s2-doi-report.py - -Extract named entities from the scraped DOI links (IEEE, ACM, etc). - -### s2-geocode.py - -Geocode lists of entities using Nominativ. - -### s2-citation-report.py - -For each paper in the citations CSV, find the corresponding paper in the database, and get all the citations. -For each of the citations, try to find an address for each one. Embed the appropriate entries from institutions list and then render them on a leaflet map. - ---- - -## Cleaning the Data - -After scraping these universities, we got up to 47% match rate on papers from the dataset. However there is still more to solve: - -- Fix the geocoding - this must be done manually - we will dedupe the entries in the entities table, then extract specific entities from the dataset. -- Unknown addresses - we have addresses for some places but we need to a) geocode them again b) geocode just the city or something -- Match across multiple lines -- Empty addresses - some papers need to be gone through by hand? Maybe we can do digram/trigram analysis on the headings. Just finding common words would help. -- Make a list of bogus papers - ones where PDFminer returned empty results, or which did not contain the word ABSTRACT, or were too long. - -### expand-uni-lookup.py - -By now I had a list of institutions in `reports/all_institutions.csv` (done by merging the results of the geocoding, as I had done this on 4 computers and thus had 4 files of institutions). This file must be gone through manually. This technique geocoded around 47% of papers. - -At this point I moved `reports/all_institutions.csv` into the Google Sheets. All further results use the CSV on Google Sheets. - -### s2-pdf-report.py - -Generates reports of things from the PDFs that were not found. - -### s2-geocode-spreadsheet.py - -To add new institutions, simply list them in the spreadsheet with the lat/lng fields empty. Then run this script and anything missing a lat/lng will get one. - -### s2-citation-report.py - -Generate the main report with maps and citation lists. - ---- - -## Useful scripts for batch processing - -### split-csv.py - -Shuffle and split a CSV into multiple files. - -### merge-csv.py - -Merge a folder of CSVs into a single file, deduping based on the first column. |
