README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

# Thesaurus

This library spiders the University of Glasgow's [Historical Thesaurus of English](https://ht.ac.uk/) to create linkages between words.  Give it two words, and it will find a connection of a specified depth using synonyms.

The Historical Thesaurus collects words under categories, which we can use to get from word to word.  Many words have multiple meanings, which allows us to traverse vast stretches of linguistic terrain efficiently.

## Installation

First, install [Miniconda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html).  Next, clone the repo and set up the Python environment.

```
git clone https://github.com/julescarbon/thesaurus.git
cd thesaurus
conda create -n thesaurus python=3.7
conda activate thesaurus
pip install -r requirements.txt
```

You should then have everything you need to run the scripts.  The scripts are all accessible through `cli.py` and are self-documenting:

```
cd cli
python cli.py --help
python cli.py bridge --help
python cli.py bridge words --help
```

## Bridging two words

The algorithm starts from each word and builds two trees.  When the trees overlap, it checks to see whether the linkage is of an appropriate depth.  If it is, it prints the chain, and gives you the option to keep searching, or to eliminate problematic words.

For example, bridge two words:

```
python cli.py bridge words --a light --b dark
```

### Command-line flags

| Flag                          | Description                            | Default      |
| ----------------------------- | -------------------------------------- | ------------ |
| --a TEXT                      | Starting word                          | (required)   |
| --b TEXT                      | Ending word                            | (required)   |
| --include_oe                  | Include OE/archaic words               | (off)        |
| --include_slang               | Include slang/colloquial words         | (off)        |
| --include_scots               | Include Scots words                    | (off)        |
| --shuffle/--no_shuffle        | Shuffle the queues                     | --shuffle    |
| --words_per_step INTEGER      | Number of words to check per step      | 20           |
| --categories_per_word INTEGER | Number of categories to check per word | 3            |
| --min_depth INTEGER           | Minimum depth of matches               | 10           |
| --use_shortest_path           | Use shortest path between words        | (off)        |
| --help                        | Show this message and exit.            |              |

### Resolution and depth

Using `--min_depth` you can specify how many hops you want.  This counts words plus categories, so for 10 hops we will get at least 4 words and 6 categories in between.

You can vary the search resolution by specifying how many words/categories are searched at a time.  With each pass, we shuffle the list of unseen words and categories, so you can force a broader or deeper search by tweaking these numbers.

When computing the chain of synonyms, we descend the category tree from word to word, based on when the words were seen.  This process can be short-circuited by finding the deepest word in the category, rather than the word adjacent from the traversal.  This ignores the `--min_depth` parameter.  But this can make the paths very short, so instead we prefer finding longer paths.

### Breaking the chain

Finally, after printing the chain, you have the option to cut the chain or to keep searching.  Each word in the chain will have a number assigned to it, and you just type in this number, or hit enter to find new chains.

Cutting the chain  will clear the trees and restart the process.  If the combination is found again, it will be ignored.  For example, the word "crash" has many categories involving dogs, which you might ignore.  Or you might want to eliminate 16th century words that come up - for example, "swarf" for unconsciousness - but many of these are still in use, so we don't provide an option for skipping them.

All the words and categories are cached in the `data_store` directory, which speeds up the process of finding new synonyms.  The data is stored as JSON.  This folder can grow to the hundreds of megabytes, if you've done a lot of searches, but it can be safely removed when you're finished.

## And finally

### Hey, show some respect

Please do not use this script to spider the entire Historical Thesaurus.  This violates their terms of service!  Take what you need, then delete the data when you're done.  This script just does the searches that you could have done on your own, but more efficiently.

### All rights reserved

This software is private and unlicensed.  We retain all copyright in the code itself.  We disclaim all responsibility or liability resulting in its usage.  Unauthorized usage is blatanty illegal.  Download at your own peril.  Violators will be pursued into the unwaking world on the other side of sleep, hunted in the twilight of nightmare, and punished with the full weight of metaphor.

2020