A “Spider” is a computer program. The Spider goes to different websites, explores, finds new things, and makes notes. When finished, it’s notes are organised in to files describing everything it has seen.
The next step is “Word Embedding”. A complicated topic! Word embedding creates a map, with coordinates, and puts each word on the map in it’s own location. Artificial Intelligence research from written language translation websites is used. English words that have a similar meaning, are grouped in to close areas on the map. This is important! Similar ideas are in similar areas in the map.
When you search for a word, your computer does this:
- Download a map of all the signs the spider found, and what kind of meaning they have
- Look at the Word Embedding map, and find the word you searched for, remembering the location
- Check every sign, and see how close it’s meaning is, to the word you entered, by how far away they are on the map
- Make a list, organised so the closest signs are at the top
- Download small videos and descriptions, and create the webpage, with results listed down the page
Privacy and Tracking
We use a program called Fathom to see how many people use this website, and to find out what website they came from if they clicked a link to come here. When you do a search, we don’t find out what words you typed in to the search box. We might be able to guess what kind of thing you were looking at, by seeing which search results your computer downloaded, but we don’t know exactly what you wrote. We don’t send your information to Google or Facebook at all, but if you come here using one of their apps like Google Chrome or Facebook App, they might be able to track you while you use their app.
Open Source and Dataset
Find Sign is open source (mostly with Unlicense software license). You can get the source code at GitHub.
Find Sign indexes copyright data, so the dataset is not open source, but it is open access. You can download an up to date copy of Find Sign’s dataset using a BitTorrent app that supports webseed (most do!) by using the automatically generated datasets.torrent. If you download the GitHub repository, and add the datasets folder from the torrent, and run it from a regular static http server, you should have a fully functioning local copy of Find Sign including a full copy of the search index.
You should be able to easily reconfigure the spider to index other SignBank based sign language dictionary websites, or index instagram and youtube sources. Take a look at the spider code for examples of how to build custom spiders to index other websites.
If you just want to ingest the search index, a reasonable approach is to download the datasets torrent, and read through all the json files in the def folder. These files are regular json files, sharded by some hashing, but easy to index in to another search index. paths to video files are relative to the path to the folder containing the index.bin file.