summaryrefslogtreecommitdiff
path: root/scraper/README.md
blob: 782fa30ae6d22e9e8c2e1d2ee039ed632ba6ba79 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# megapixels dev

## installation

```
conda create -n megapixels python=3.6
pip install urllib3
pip install requests
pip install simplejson
pip install click
pip install pdfminer.six
pip install csvtool
npm install
```

## workflow

```
Paper in spreadsheet -> paper_name
  -> S2 Search API -> paper_id
  -> S2 Paper API -> citations
  -> S2 Dataset -> full records with PDF URLs, authors, more citations
  -> wget -> .pdf files
  -> pdfminer.six -> pdf text
  -> text mining -> named entities (organizations)
  -> Geocoding service -> lat/lngs
```

To begin, export `datasets/citations.csv` from the Google doc.

---

## Extracting data from S2 / ORC

The Open Research Corpus (ORC) is produced by the Allen Institute / Semantic Scholar (S2) / arXiv people. It may be downloaded here:

http://labs.semanticscholar.org/corpus/

### s2-search.py

Loads titles from citations file and queries the S2 search API to get paper IDs, then uses the paper IDs from the search entries to query the S2 papers API to get first-degree citations, authors, etc.

### s2-dump-ids.py

Extract all the paper IDs and citation IDs from the queried papers.

### s2-extract-papers.py

Extracts papers from the ORC dataset which have been queried from the API.

### s2-raw-papers.py

Some papers are not in the ORC dataset and must be scraped from S2 directly.

---

## Extracting data from Google Scholar

Included in the content-script folder is a Chrome extension which scrapes Google Scholar through the browser, clicking the links and extracting PDF links, number of citations, etc, then saving a JSON file when it's done.  Still requires work to process the output (crossreference with S2 and dump the PDFs).

---

## Scraping Institutions

Once you have the data from S2, you can scrape all the PDFs (and other URLs) you find, and then extract institutions from those and geocode them.

### s2-dump-pdf-urls.py

Dump PDF urls (and also IEEE urls etc) to CSV files.

### s2-fetch-pdf.py

Fetch the PDFs.

### s2-fetch-doi.py

Fetch the files listed in ieee.json and process them.

### pdf_dump_first_page.sh

Use pdfminer.six to extract the first page from the PDFs.

### s2-pdf-first-pages.py

Perform initial extraction of university-like terms, to be geocoded.

### s2-doi-report.py

Extract named entities from the scraped DOI links (IEEE, ACM, etc).

### s2-geocode.py

Geocode lists of entities using Nominativ.

### s2-citation-report.py

For each paper in the citations CSV, find the corresponding paper in the database, and get all the citations.
For each of the citations, try to find an address for each one. Embed the appropriate entries from institutions list and then render them on a leaflet map.

---

## Cleaning the Data

After scraping these universities, we got up to 47% match rate on papers from the dataset.  However there is still more to solve:

- Fix the geocoding - this must be done manually - we will dedupe the entries in the entities table, then extract specific entities from the dataset.
- Unknown addresses - we have addresses for some places but we need to a) geocode them again b) geocode just the city or something
- Match across multiple lines
- Empty addresses - some papers need to be gone through by hand?  Maybe we can do digram/trigram analysis on the headings.  Just finding common words would help.
- Make a list of bogus papers - ones where PDFminer returned empty results, or which did not contain the word ABSTRACT, or were too long.

### expand-uni-lookup.py

By now I had a list of institutions in `reports/all_institutions.csv` (done by merging the results of the geocoding, as I had done this on 4 computers and thus had 4 files of institutions).  This file must be gone through manually.  This technique geocoded around 47% of papers.

At this point I moved `reports/all_institutions.csv` into the Google Sheets.  All further results use the CSV on Google Sheets.

### s2-pdf-report.py

Generates reports of things from the PDFs that were not found.

### s2-geocode-spreadsheet.py

To add new institutions, simply list them in the spreadsheet with the lat/lng fields empty.  Then run this script and anything missing a lat/lng will get one.

### s2-citation-report.py

Generate the main report with maps and citation lists.

---

## Useful scripts for batch processing

### split-csv.py

Shuffle and split a CSV into multiple files.

### merge-csv.py

Merge a folder of CSVs into a single file, deduping based on the first column.