{"_id":"57b21a1fa20e6119008d68e7","parentDoc":null,"category":{"_id":"568adfffcbd4ca0d00aebf7e","project":"5633ebff7e9e880d00af1a53","version":"5633ec007e9e880d00af1a56","__v":1,"pages":["568ae00c7a42220d00498265"],"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-01-04T21:11:27.469Z","from_sync":false,"order":5,"slug":"dashboards","title":"Common Tools"},"project":"5633ebff7e9e880d00af1a53","version":{"_id":"5633ec007e9e880d00af1a56","project":"5633ebff7e9e880d00af1a53","__v":16,"createdAt":"2015-10-30T22:15:28.105Z","releaseDate":"2015-10-30T22:15:28.105Z","categories":["5633ec007e9e880d00af1a57","5633f072737ea01700ea329d","5637a37d0704070d00f06cf4","5637cf4e7ca5de0d00286aeb","564503082c74cf1900da48b4","564503cb7f1fff210078e70a","567af26cb56bac0d0019d87d","567afeb8802b2b17005ddea0","567aff47802b2b17005ddea1","567b0005802b2b17005ddea3","568adfffcbd4ca0d00aebf7e","56ba80078cf7c9210009673e","574d127f6f075519007da3d0","574fde60aef76a0e00840927","57a22ba6cd51b22d00f623a0","5a062c15a66ae1001a3f5b09"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"__v":3,"user":"5633ec9b35355017003ca3f2","updates":["5a1f80928d762a002a98f4de","5a1fa01e681312001e23af9b"],"next":{"pages":[],"description":""},"createdAt":"2016-08-15T19:38:07.235Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":4,"body":"Let’s say that you have an online store with a lot of products. You want users to be able to search for those products, but you want that search to be smart. For example, say that your user searches for \"bowtie pasta.\" You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has \"farfalle\" instead of \"bowtie pasta\". How do you address this issue?\n\nSolr has a mechanism for defining custom synonyms, through the [SynonymFilterFactory](https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory)). This  lets search administrators define groups of related terms and even corrections to commonly misspelled terms. A typical synonyms.txt file might look like this:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"i-pod, i pod => ipod\\nfeline,kitten,cat,kitty\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\nThis is great for solving the proximate issue, but what it can get extremely tedious to define all groups of related words in your index.\n\n# What is WordNet?\n\nWordNet has been described as a [lexical database](https://en.wikipedia.org/wiki/Lexical_database) by its creators. Essentially it is a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. An entry in WordNet looks something like this:\n\n`s(102122298,1,\"kitty\",n,4,0).`\n\nlet's break it down:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/77fe711-synset-breakdown.png\",\n        \"synset-breakdown.png\",\n        1200,\n        442,\n        \"#ddeef8\"\n      ]\n    }\n  ]\n}\n[/block]\nThis line expresses that the word 'kitty' is a noun, and the first word in synset 102122298 (which includes other terms like \"kitty-cat,\" \"pussycat,\" and so on). The line also indicates 'kitty' is the fourth most commonly used term according to semantic concordance texts. You can read more about the structure and precise definitions of WordNet entries in the [documentation](https://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html).\n\nThe WordNet has become extremely useful in text processing applications, including data storage and retrieval. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable.\n\n## How Do I Use the WordNet list with Bonsai?\n\nThere are two ways to use WordNet with Bonsai. Users can add a subset of the list using the  [Elasticsearch API](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html), or use the WordNet file that comes standard with all Bonsai clusters.\n\n### Using WordNet via the API\n\nElasticsearch supports several different list formats, including the WordNet format. WordNet synonyms are maintained in a Prolog file called `wn_s.pl`. To use these in your cluster, you'll need to [download](http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz) the WordNet archive and extract the `wn_s.pl` file. You'll then need to create your synonyms list by reading this file into a request to your cluster.\n\nThe target index could be created with settings like so:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"\\n    PUT https://username:password:::at:::my-awesome-cluster.us-east-1.bonsai.io/some_index\\n    \\n    {\\n      \\\"settings\\\": {\\n        \\\"analysis\\\": {\\n          \\\"filter\\\": {\\n            \\\"wn_synonym_filter\\\": {\\n              \\\"type\\\": \\\"synonym\\\",\\n              \\\"format\\\" : \\\"wordnet\\\",\\n              \\\"synonyms\\\" : [\\n                  \\\"s(100000001,1,\\\"abstain\\\",v,1,0).\\\",\\n                  \\\"s(100000001,2,\\\"refrain\\\",v,1,0).\\\",\\n                  \\\"s(100000001,3,\\\"desist\\\",v,1,0).\\\",\\n                  #... more synonyms, read from wn_s.pl file\\n              ]\\n            }\\n          },\\n          \\\"analyzer\\\": {\\n            \\\"my_synonyms\\\": {\\n              \\\"tokenizer\\\": \\\"standard\\\",\\n              \\\"filter\\\": [\\n                \\\"lowercase\\\",\\n                \\\"wn_synonym_filter\\\"\\n              ]\\n            }\\n          }\\n        }\\n      }\\n    }\\n\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\nThere are a number of ways to generate this request. You could do it programmatically with a language like Python, or Bash scripts with curl, or any language with which you feel comfortable.\n\nA benefit of using a subset of the list would be more control over your mappings and data footprint. Depending on when your analyzer is running, you could save IO by not computing unnecessary expansions for terms not in your corpus or search parameters. Reducing the overhead will improve performance overall.\n\n### Using the WordNet list\n\nIf you would rather use the official WordNet list, it is part of our normal deployment process. You can follow the official [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html) for WordNet synonyms, and link to the file with `analysis/wn_s.pl`. For example:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"PUT https://username:password@my-awesome-cluster.us-east-1.bonsai.io/some_index\\n{\\n    \\\"settings\\\": {\\n        \\\"index\\\" : {\\n            \\\"analysis\\\" : {\\n                \\\"analyzer\\\" : {\\n                    \\\"synonym\\\" : {\\n                        \\\"tokenizer\\\" : \\\"whitespace\\\",\\n                        \\\"filter\\\" : [\\\"synonym\\\"]\\n                    }\\n                },\\n                \\\"filter\\\" : {\\n                    \\\"synonym\\\" : {\\n                        \\\"type\\\" : \\\"synonym\\\",\\n                        \\\"synonyms_path\\\" : \\\"analysis/wn_s.pl\\\"\\n                   }\\n                }\\n            }\\n        }\\n    }\\n}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n## Resources\n\nWordNet is a large subject and a great topic to delve deeper into. Here are some links for further reading:\n\n\n- WordNet, a Princeton Project https://wordnet.princeton.edu/wordnet/\n- Download WordNet: https://wordnet.princeton.edu/wordnet/download/\n- WordNet Documentation: https://wordnet.princeton.edu/wordnet/documentation/\n- WordNet in other languages: http://globalwordnet.org/","excerpt":"Learn how to match synonyms in your searches using Wordnet and Elasticsearch.","slug":"using-wordnet-with-bonsai","type":"basic","title":"Using WordNet with Bonsai"}

Using WordNet with Bonsai

Learn how to match synonyms in your searches using Wordnet and Elasticsearch.

Let’s say that you have an online store with a lot of products. You want users to be able to search for those products, but you want that search to be smart. For example, say that your user searches for "bowtie pasta." You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has "farfalle" instead of "bowtie pasta". How do you address this issue? Solr has a mechanism for defining custom synonyms, through the [SynonymFilterFactory](https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory)). This lets search administrators define groups of related terms and even corrections to commonly misspelled terms. A typical synonyms.txt file might look like this: [block:code] { "codes": [ { "code": "i-pod, i pod => ipod\nfeline,kitten,cat,kitty", "language": "text" } ] } [/block] This is great for solving the proximate issue, but what it can get extremely tedious to define all groups of related words in your index. # What is WordNet? WordNet has been described as a [lexical database](https://en.wikipedia.org/wiki/Lexical_database) by its creators. Essentially it is a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. An entry in WordNet looks something like this: `s(102122298,1,"kitty",n,4,0).` let's break it down: [block:image] { "images": [ { "image": [ "https://files.readme.io/77fe711-synset-breakdown.png", "synset-breakdown.png", 1200, 442, "#ddeef8" ] } ] } [/block] This line expresses that the word 'kitty' is a noun, and the first word in synset 102122298 (which includes other terms like "kitty-cat," "pussycat," and so on). The line also indicates 'kitty' is the fourth most commonly used term according to semantic concordance texts. You can read more about the structure and precise definitions of WordNet entries in the [documentation](https://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html). The WordNet has become extremely useful in text processing applications, including data storage and retrieval. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable. ## How Do I Use the WordNet list with Bonsai? There are two ways to use WordNet with Bonsai. Users can add a subset of the list using the [Elasticsearch API](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html), or use the WordNet file that comes standard with all Bonsai clusters. ### Using WordNet via the API Elasticsearch supports several different list formats, including the WordNet format. WordNet synonyms are maintained in a Prolog file called `wn_s.pl`. To use these in your cluster, you'll need to [download](http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz) the WordNet archive and extract the `wn_s.pl` file. You'll then need to create your synonyms list by reading this file into a request to your cluster. The target index could be created with settings like so: [block:code] { "codes": [ { "code": "\n PUT https://username:password@my-awesome-cluster.us-east-1.bonsai.io/some_index\n \n {\n \"settings\": {\n \"analysis\": {\n \"filter\": {\n \"wn_synonym_filter\": {\n \"type\": \"synonym\",\n \"format\" : \"wordnet\",\n \"synonyms\" : [\n \"s(100000001,1,\"abstain\",v,1,0).\",\n \"s(100000001,2,\"refrain\",v,1,0).\",\n \"s(100000001,3,\"desist\",v,1,0).\",\n #... more synonyms, read from wn_s.pl file\n ]\n }\n },\n \"analyzer\": {\n \"my_synonyms\": {\n \"tokenizer\": \"standard\",\n \"filter\": [\n \"lowercase\",\n \"wn_synonym_filter\"\n ]\n }\n }\n }\n }\n }\n", "language": "json" } ] } [/block] There are a number of ways to generate this request. You could do it programmatically with a language like Python, or Bash scripts with curl, or any language with which you feel comfortable. A benefit of using a subset of the list would be more control over your mappings and data footprint. Depending on when your analyzer is running, you could save IO by not computing unnecessary expansions for terms not in your corpus or search parameters. Reducing the overhead will improve performance overall. ### Using the WordNet list If you would rather use the official WordNet list, it is part of our normal deployment process. You can follow the official [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html) for WordNet synonyms, and link to the file with `analysis/wn_s.pl`. For example: [block:code] { "codes": [ { "code": "PUT https://username:password@my-awesome-cluster.us-east-1.bonsai.io/some_index\n{\n \"settings\": {\n \"index\" : {\n \"analysis\" : {\n \"analyzer\" : {\n \"synonym\" : {\n \"tokenizer\" : \"whitespace\",\n \"filter\" : [\"synonym\"]\n }\n },\n \"filter\" : {\n \"synonym\" : {\n \"type\" : \"synonym\",\n \"synonyms_path\" : \"analysis/wn_s.pl\"\n }\n }\n }\n }\n }\n}", "language": "json" } ] } [/block] ## Resources WordNet is a large subject and a great topic to delve deeper into. Here are some links for further reading: - WordNet, a Princeton Project https://wordnet.princeton.edu/wordnet/ - Download WordNet: https://wordnet.princeton.edu/wordnet/download/ - WordNet Documentation: https://wordnet.princeton.edu/wordnet/documentation/ - WordNet in other languages: http://globalwordnet.org/