{"_id":"5a8fae0468264c001f20cc45","category":{"_id":"5a8fae0368264c001f20cc05","version":"5a8fae0268264c001f20cc00","project":"5633ebff7e9e880d00af1a53","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-01-04T21:11:27.469Z","from_sync":false,"order":6,"slug":"dashboards","title":"Common Tools"},"parentDoc":null,"project":"5633ebff7e9e880d00af1a53","user":"5633ec9b35355017003ca3f2","version":{"_id":"5a8fae0268264c001f20cc00","project":"5633ebff7e9e880d00af1a53","__v":2,"createdAt":"2018-02-23T06:00:34.961Z","releaseDate":"2018-02-23T06:00:34.961Z","categories":["5a8fae0268264c001f20cc01","5a8fae0268264c001f20cc02","5a8fae0368264c001f20cc03","5a8fae0368264c001f20cc04","5a8fae0368264c001f20cc05","5a8fae0368264c001f20cc06","5a8fae0368264c001f20cc07","5a8fae0368264c001f20cc08","5a8fae0368264c001f20cc09","5abaa7eb72d6dc0028a07bf3"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"2.0.0","version":"2.0"},"__v":0,"updates":["5a1f80928d762a002a98f4de","5a1fa01e681312001e23af9b"],"next":{"pages":[],"description":""},"createdAt":"2016-08-15T19:38:07.235Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":3,"body":"WordNet is a huge [lexical database](https://en.wikipedia.org/wiki/Lexical_database) that collects and orders English words into groups of synonyms. It can offer major improvements in relevancy, but it is not at all necessary for many use cases. Make sure you understand the [tradeoffs ](doc:using-wordnet-with-bonsai#why-wouldnt-everyone-want-wordnet) well before setting it up.\n\nThere are two ways to use WordNet with Bonsai. Users can add a subset of the list using the [Elasticsearch API](doc:using-wordnet-with-bonsai#using-wordnet-via-the-elasticsearch-api), or use the [WordNet file](doc:using-wordnet-with-bonsai#using-the-wordnet-list-file-wn_spl) that comes standard with all Bonsai clusters.\n\nRead on to learn more about WordNet.\n[block:api-header]\n{\n  \"title\": \"What Are Synonyms?\"\n}\n[/block]\nLet’s say that you have an online store with a lot of products. You want users to be able to search for those products, but you want that search to be smart. For example, say that your user searches for \"bowtie pasta.\" You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has \"farfalle\" instead of \"bowtie pasta\". How do you address this issue?\n\nElasticsearch has a mechanism for defining custom synonyms, through the [Synonym Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html). This lets search administrators define groups of related terms and even corrections to commonly misspelled terms. A solution to this use case might look like this:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{\\n    \\\"settings\\\": {\\n        \\\"index\\\" : {\\n            \\\"analysis\\\" : {\\n                \\\"filter\\\" : {\\n                    \\\"synonym\\\" : {\\n                        \\\"type\\\" : \\\"synonym\\\",\\n                        \\\"synonyms\\\" : [\\n                            \\\"bowtie pasta, farfalle\\\"\\n                        ]\\n                    }\\n                }\\n            }\\n        }\\n    }\\n}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\nThis is great for solving the proximate issue, but what it can get extremely tedious to define _all_ groups of related words in your index.\n[block:api-header]\n{\n  \"title\": \"How Does WordNet Improve Synonyms?\"\n}\n[/block]\nWordNet is essentially a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. An entry in WordNet looks something like this:\n\n```\ns(102122298,1,\"kitty\",n,4,0).\n```\n\nLet's break it down:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/77fe711-synset-breakdown.png\",\n        \"synset-breakdown.png\",\n        1200,\n        442,\n        \"#ddeef8\"\n      ]\n    }\n  ]\n}\n[/block]\nThis line expresses that the word 'kitty' is a noun, and the first word in synset 102122298 (which includes other terms like \"kitty-cat,\" \"pussycat,\" and so on). The line also indicates 'kitty' is the fourth most commonly used term according to semantic concordance texts. You can read more about the structure and precise definitions of WordNet entries in the [documentation](https://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html).\n\nThe WordNet has become extremely useful in text processing applications, including data storage and retrieval. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable.\n[block:api-header]\n{\n  \"title\": \"Why Wouldn't Everyone Want WordNet?\"\n}\n[/block]\nRelevancy tuning can be a deeply complex subject, and WordNet -- especially when the [complete file](doc:using-wordnet-with-bonsai#using-the-wordnet-list-file-wn_spl) is used -- has tradeoffs, just like any other strategy. Synonym expansion can be really tricky and can result in unexpected sorting, lower performance and more disk use. WordNet can introduce all of these issues with varying severity.\n\nWhen synonyms are expanded at index time, Elasticsearch uses WordNet to generate all tokens related to a given token, and writes everything out to disk. This has several consequences: slower indexing speed, higher load during indexing, and significantly more disk use. Larger index sizes often correspond to memory issues too, if the entire index can't reside in memory.\n\nThere is also the problem of updating. If you ever want to change your synonym list, you'll need to reindex everything from scratch. And WordNet includes multi-term synonyms in its database, which can break phrase queries.\n\nExpanding synonyms at query time resolves _some_ of those issues, but introduces others. Namely, performing expansion and matching at query time adds overhead to your queries in terms of server load and latency. And it still doesn't really address the problem of multi word synonyms.\n\nThe Elasticsearch documentation [some really great examples](https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html) of what this means. The takeaway is that WordNet is _not_ a panacea for relevancy tuning, and it may introduce unexpected results unless you're doing a lot of preprocessing or additional configuration.\n\n**tl;dr:** Do not simply assume that chucking a massive synset collection at your cluster will make it faster with more relevant results.\n[block:api-header]\n{\n  \"title\": \"Using WordNet via the Elasticsearch API\"\n}\n[/block]\nElasticsearch supports several different list formats, including the WordNet format. WordNet synonyms are maintained in a Prolog file called `wn_s.pl`. To use these in your cluster, you'll need to [download](http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz) the WordNet archive and extract the `wn_s.pl` file. You'll then need to create your synonyms list by reading this file into a request to your cluster.\n\nThe target index could be created with settings like so:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"\\n    PUT https://randomuser:randompass:::at:::something-12345.us-east-1.bonsai.io/some_index\\n    \\n    {\\n      \\\"settings\\\": {\\n        \\\"analysis\\\": {\\n          \\\"filter\\\": {\\n            \\\"wn_synonym_filter\\\": {\\n              \\\"type\\\": \\\"synonym\\\",\\n              \\\"format\\\" : \\\"wordnet\\\",\\n              \\\"synonyms\\\" : [\\n                  \\\"s(100000001,1,\\\"abstain\\\",v,1,0).\\\",\\n                  \\\"s(100000001,2,\\\"refrain\\\",v,1,0).\\\",\\n                  \\\"s(100000001,3,\\\"desist\\\",v,1,0).\\\",\\n                  #... more synonyms, read from wn_s.pl file\\n              ]\\n            }\\n          },\\n          \\\"analyzer\\\": {\\n            \\\"my_synonyms\\\": {\\n              \\\"tokenizer\\\": \\\"standard\\\",\\n              \\\"filter\\\": [\\n                \\\"lowercase\\\",\\n                \\\"wn_synonym_filter\\\"\\n              ]\\n            }\\n          }\\n        }\\n      }\\n    }\\n\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\nThere are a number of ways to generate this request. You could do it programmatically with a language like Python, or Bash scripts with `curl`, or any language with which you feel comfortable.\n\nA benefit of using a subset of the list would be more control over your mappings and data footprint. Depending on when your analyzer is running, you could save IO by not computing unnecessary expansions for terms not in your corpus or search parameters. Reducing the overhead will improve performance overall.\n[block:api-header]\n{\n  \"title\": \"Using the WordNet List File, wn_s.pl\"\n}\n[/block]\nIf you would rather use the official WordNet list, it is part of our Elasticsearch deployment. You can follow the official [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html) for WordNet synonyms, and link to the file with `analysis/wn_s.pl`. For example:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"PUT https://username:password@my-awesome-cluster.us-east-1.bonsai.io/some_index\\n{\\n    \\\"settings\\\": {\\n        \\\"index\\\" : {\\n            \\\"analysis\\\" : {\\n                \\\"analyzer\\\" : {\\n                    \\\"synonym\\\" : {\\n                        \\\"tokenizer\\\" : \\\"whitespace\\\",\\n                        \\\"filter\\\" : [\\\"synonym\\\"]\\n                    }\\n                },\\n                \\\"filter\\\" : {\\n                    \\\"synonym\\\" : {\\n                        \\\"type\\\" : \\\"synonym\\\",\\n                        \\\"synonyms_path\\\" : \\\"analysis/wn_s.pl\\\"\\n                   }\\n                }\\n            }\\n        }\\n    }\\n}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Resources\"\n}\n[/block]\nWordNet is a large subject and a great topic to delve deeper into. Here are some links for further reading:\n\n- [WordNet, a Princeton Project](https://wordnet.princeton.edu/wordnet/)\n- [Download WordNet](https://wordnet.princeton.edu/wordnet/download/)\n- [WordNet Documentation](https://wordnet.princeton.edu/wordnet/documentation/)\n- [WordNet in other languages](http://globalwordnet.org/)","excerpt":"Learn how to match synonyms in your searches using Wordnet and Elasticsearch.","slug":"using-wordnet-with-bonsai","type":"basic","title":"Using WordNet with Bonsai"}

Using WordNet with Bonsai

Learn how to match synonyms in your searches using Wordnet and Elasticsearch.

WordNet is a huge [lexical database](https://en.wikipedia.org/wiki/Lexical_database) that collects and orders English words into groups of synonyms. It can offer major improvements in relevancy, but it is not at all necessary for many use cases. Make sure you understand the [tradeoffs ](doc:using-wordnet-with-bonsai#why-wouldnt-everyone-want-wordnet) well before setting it up. There are two ways to use WordNet with Bonsai. Users can add a subset of the list using the [Elasticsearch API](doc:using-wordnet-with-bonsai#using-wordnet-via-the-elasticsearch-api), or use the [WordNet file](doc:using-wordnet-with-bonsai#using-the-wordnet-list-file-wn_spl) that comes standard with all Bonsai clusters. Read on to learn more about WordNet. [block:api-header] { "title": "What Are Synonyms?" } [/block] Let’s say that you have an online store with a lot of products. You want users to be able to search for those products, but you want that search to be smart. For example, say that your user searches for "bowtie pasta." You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has "farfalle" instead of "bowtie pasta". How do you address this issue? Elasticsearch has a mechanism for defining custom synonyms, through the [Synonym Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html). This lets search administrators define groups of related terms and even corrections to commonly misspelled terms. A solution to this use case might look like this: [block:code] { "codes": [ { "code": "{\n \"settings\": {\n \"index\" : {\n \"analysis\" : {\n \"filter\" : {\n \"synonym\" : {\n \"type\" : \"synonym\",\n \"synonyms\" : [\n \"bowtie pasta, farfalle\"\n ]\n }\n }\n }\n }\n }\n}", "language": "json" } ] } [/block] This is great for solving the proximate issue, but what it can get extremely tedious to define _all_ groups of related words in your index. [block:api-header] { "title": "How Does WordNet Improve Synonyms?" } [/block] WordNet is essentially a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. An entry in WordNet looks something like this: ``` s(102122298,1,"kitty",n,4,0). ``` Let's break it down: [block:image] { "images": [ { "image": [ "https://files.readme.io/77fe711-synset-breakdown.png", "synset-breakdown.png", 1200, 442, "#ddeef8" ] } ] } [/block] This line expresses that the word 'kitty' is a noun, and the first word in synset 102122298 (which includes other terms like "kitty-cat," "pussycat," and so on). The line also indicates 'kitty' is the fourth most commonly used term according to semantic concordance texts. You can read more about the structure and precise definitions of WordNet entries in the [documentation](https://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html). The WordNet has become extremely useful in text processing applications, including data storage and retrieval. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable. [block:api-header] { "title": "Why Wouldn't Everyone Want WordNet?" } [/block] Relevancy tuning can be a deeply complex subject, and WordNet -- especially when the [complete file](doc:using-wordnet-with-bonsai#using-the-wordnet-list-file-wn_spl) is used -- has tradeoffs, just like any other strategy. Synonym expansion can be really tricky and can result in unexpected sorting, lower performance and more disk use. WordNet can introduce all of these issues with varying severity. When synonyms are expanded at index time, Elasticsearch uses WordNet to generate all tokens related to a given token, and writes everything out to disk. This has several consequences: slower indexing speed, higher load during indexing, and significantly more disk use. Larger index sizes often correspond to memory issues too, if the entire index can't reside in memory. There is also the problem of updating. If you ever want to change your synonym list, you'll need to reindex everything from scratch. And WordNet includes multi-term synonyms in its database, which can break phrase queries. Expanding synonyms at query time resolves _some_ of those issues, but introduces others. Namely, performing expansion and matching at query time adds overhead to your queries in terms of server load and latency. And it still doesn't really address the problem of multi word synonyms. The Elasticsearch documentation [some really great examples](https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html) of what this means. The takeaway is that WordNet is _not_ a panacea for relevancy tuning, and it may introduce unexpected results unless you're doing a lot of preprocessing or additional configuration. **tl;dr:** Do not simply assume that chucking a massive synset collection at your cluster will make it faster with more relevant results. [block:api-header] { "title": "Using WordNet via the Elasticsearch API" } [/block] Elasticsearch supports several different list formats, including the WordNet format. WordNet synonyms are maintained in a Prolog file called `wn_s.pl`. To use these in your cluster, you'll need to [download](http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz) the WordNet archive and extract the `wn_s.pl` file. You'll then need to create your synonyms list by reading this file into a request to your cluster. The target index could be created with settings like so: [block:code] { "codes": [ { "code": "\n PUT https://randomuser:randompass@something-12345.us-east-1.bonsai.io/some_index\n \n {\n \"settings\": {\n \"analysis\": {\n \"filter\": {\n \"wn_synonym_filter\": {\n \"type\": \"synonym\",\n \"format\" : \"wordnet\",\n \"synonyms\" : [\n \"s(100000001,1,\"abstain\",v,1,0).\",\n \"s(100000001,2,\"refrain\",v,1,0).\",\n \"s(100000001,3,\"desist\",v,1,0).\",\n #... more synonyms, read from wn_s.pl file\n ]\n }\n },\n \"analyzer\": {\n \"my_synonyms\": {\n \"tokenizer\": \"standard\",\n \"filter\": [\n \"lowercase\",\n \"wn_synonym_filter\"\n ]\n }\n }\n }\n }\n }\n", "language": "json" } ] } [/block] There are a number of ways to generate this request. You could do it programmatically with a language like Python, or Bash scripts with `curl`, or any language with which you feel comfortable. A benefit of using a subset of the list would be more control over your mappings and data footprint. Depending on when your analyzer is running, you could save IO by not computing unnecessary expansions for terms not in your corpus or search parameters. Reducing the overhead will improve performance overall. [block:api-header] { "title": "Using the WordNet List File, wn_s.pl" } [/block] If you would rather use the official WordNet list, it is part of our Elasticsearch deployment. You can follow the official [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html) for WordNet synonyms, and link to the file with `analysis/wn_s.pl`. For example: [block:code] { "codes": [ { "code": "PUT https://username:password@my-awesome-cluster.us-east-1.bonsai.io/some_index\n{\n \"settings\": {\n \"index\" : {\n \"analysis\" : {\n \"analyzer\" : {\n \"synonym\" : {\n \"tokenizer\" : \"whitespace\",\n \"filter\" : [\"synonym\"]\n }\n },\n \"filter\" : {\n \"synonym\" : {\n \"type\" : \"synonym\",\n \"synonyms_path\" : \"analysis/wn_s.pl\"\n }\n }\n }\n }\n }\n}", "language": "json" } ] } [/block] [block:api-header] { "title": "Resources" } [/block] WordNet is a large subject and a great topic to delve deeper into. Here are some links for further reading: - [WordNet, a Princeton Project](https://wordnet.princeton.edu/wordnet/) - [Download WordNet](https://wordnet.princeton.edu/wordnet/download/) - [WordNet Documentation](https://wordnet.princeton.edu/wordnet/documentation/) - [WordNet in other languages](http://globalwordnet.org/)