{"_id":"57b21a1fa20e6119008d68e7","parentDoc":null,"category":{"_id":"568adfffcbd4ca0d00aebf7e","project":"5633ebff7e9e880d00af1a53","version":"5633ec007e9e880d00af1a56","__v":1,"pages":["568ae00c7a42220d00498265"],"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-01-04T21:11:27.469Z","from_sync":false,"order":4,"slug":"dashboards","title":"Common Tools"},"project":"5633ebff7e9e880d00af1a53","version":{"_id":"5633ec007e9e880d00af1a56","project":"5633ebff7e9e880d00af1a53","__v":15,"createdAt":"2015-10-30T22:15:28.105Z","releaseDate":"2015-10-30T22:15:28.105Z","categories":["5633ec007e9e880d00af1a57","5633f072737ea01700ea329d","5637a37d0704070d00f06cf4","5637cf4e7ca5de0d00286aeb","564503082c74cf1900da48b4","564503cb7f1fff210078e70a","567af26cb56bac0d0019d87d","567afeb8802b2b17005ddea0","567aff47802b2b17005ddea1","567b0005802b2b17005ddea3","568adfffcbd4ca0d00aebf7e","56ba80078cf7c9210009673e","574d127f6f075519007da3d0","574fde60aef76a0e00840927","57a22ba6cd51b22d00f623a0"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"__v":0,"user":"5633ec9b35355017003ca3f2","updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-08-15T19:38:07.235Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":4,"body":"Let’s say that you have an online store with a lot of products. You want users to be able to search for those products, but you want that search to be smart. For example, say that your user searches for \"bowtie pasta.\" You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has \"farfalle\" instead of \"bowtie pasta\". How do you address this issue?\n\nSolr has a mechanism for defining custom synonyms, through the [SynonymFilterFactory](https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory)). This  lets search administrators define groups of related terms and even corrections to commonly misspelled terms. A typical synonyms.txt file might look like this:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"i-pod, i pod => ipod\\nfeline,kitten,cat,kitty\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\nThis is great for solving the proximate issue, but what it can get extremely tedious to define all groups of related words in your index.\n\n# What is WordNet?\n\nWordNet has been described as a [lexical database](https://en.wikipedia.org/wiki/Lexical_database) by its creators. Essentially it is a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. An entry in WordNet looks something like this:\n\n`s(102122298,1,\"kitty\",n,4,0).`\n\nlet's break it down:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/77fe711-synset-breakdown.png\",\n        \"synset-breakdown.png\",\n        1200,\n        442,\n        \"#ddeef8\"\n      ]\n    }\n  ]\n}\n[/block]\nThis line expresses that the word 'kitty' is a noun, and the first word in synset 102122298 (which includes other terms like \"kitty-cat,\" \"pussycat,\" and so on). The line also indicates 'kitty' is the fourth most commonly used term according to semantic concordance texts. You can read more about the structure and precise definitions of WordNet entries in the [documentation](https://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html).\n\nThe WordNet has become extremely useful in text processing applications, including data storage and retrieval. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable.\n\n## How Do I Use the WordNet list with Bonsai?\n\nBonsai allows users to create custom lists of synonyms via the [Elasticsearch API](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html). Elasticsearch supports several different list formats, including the WordNet format. Bonsai users can programmatically add the WordNet synonyms list, however, there is no mechanism to upload a file of WordNet lists to a cluster.\n\nWordNet synonyms are maintained in a Prolog file called `wn_s.pl`. To use these in your cluster, you'll need to [download](http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz) the WordNet archive and extract the `wn_s.pl` file. You'll then need to create your synonyms list by reading this file into a request to your cluster.\n\nThe target index could be created with settings like so:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"\\n    PUT https://username:password:::at:::my-awesome-cluster.us-east-1.bonsai.io/some_index\\n    \\n    {\\n      \\\"settings\\\": {\\n        \\\"analysis\\\": {\\n          \\\"filter\\\": {\\n            \\\"wn_synonym_filter\\\": {\\n              \\\"type\\\": \\\"synonym\\\",\\n              \\\"format\\\" : \\\"wordnet\\\",\\n              \\\"synonyms\\\" : [\\n                  \\\"s(100000001,1,\\\"abstain\\\",v,1,0).\\\",\\n                  \\\"s(100000001,2,\\\"refrain\\\",v,1,0).\\\",\\n                  \\\"s(100000001,3,\\\"desist\\\",v,1,0).\\\",\\n                  #... more synonyms, read from wn_s.pl file\\n              ]\\n            }\\n          },\\n          \\\"analyzer\\\": {\\n            \\\"my_synonyms\\\": {\\n              \\\"tokenizer\\\": \\\"standard\\\",\\n              \\\"filter\\\": [\\n                \\\"lowercase\\\",\\n                \\\"wn_synonym_filter\\\"\\n              ]\\n            }\\n          }\\n        }\\n      }\\n    }\\n\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\nThere are a number of ways to generate this request. You could do it programmatically with a language like Python, or Bash scripts with curl, or any language with which you feel comfortable.\n\n\n## Resources\n\nWordNet is a large subject and a great topic to delve deeper into. Here are some links for further reading:\n\n\n- WordNet, a Princeton Project https://wordnet.princeton.edu/wordnet/\n- Download WordNet: https://wordnet.princeton.edu/wordnet/download/\n- WordNet Documentation: https://wordnet.princeton.edu/wordnet/documentation/\n- WordNet in other languages: http://globalwordnet.org/","excerpt":"Learn how to match synonyms in your searches using Wordnet and Elasticsearch.","slug":"using-wordnet-with-bonsai","type":"basic","title":"Using WordNet with Bonsai"}

Using WordNet with Bonsai

Learn how to match synonyms in your searches using Wordnet and Elasticsearch.

Let’s say that you have an online store with a lot of products. You want users to be able to search for those products, but you want that search to be smart. For example, say that your user searches for "bowtie pasta." You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has "farfalle" instead of "bowtie pasta". How do you address this issue? Solr has a mechanism for defining custom synonyms, through the [SynonymFilterFactory](https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory)). This lets search administrators define groups of related terms and even corrections to commonly misspelled terms. A typical synonyms.txt file might look like this: [block:code] { "codes": [ { "code": "i-pod, i pod => ipod\nfeline,kitten,cat,kitty", "language": "text" } ] } [/block] This is great for solving the proximate issue, but what it can get extremely tedious to define all groups of related words in your index. # What is WordNet? WordNet has been described as a [lexical database](https://en.wikipedia.org/wiki/Lexical_database) by its creators. Essentially it is a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. An entry in WordNet looks something like this: `s(102122298,1,"kitty",n,4,0).` let's break it down: [block:image] { "images": [ { "image": [ "https://files.readme.io/77fe711-synset-breakdown.png", "synset-breakdown.png", 1200, 442, "#ddeef8" ] } ] } [/block] This line expresses that the word 'kitty' is a noun, and the first word in synset 102122298 (which includes other terms like "kitty-cat," "pussycat," and so on). The line also indicates 'kitty' is the fourth most commonly used term according to semantic concordance texts. You can read more about the structure and precise definitions of WordNet entries in the [documentation](https://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html). The WordNet has become extremely useful in text processing applications, including data storage and retrieval. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable. ## How Do I Use the WordNet list with Bonsai? Bonsai allows users to create custom lists of synonyms via the [Elasticsearch API](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html). Elasticsearch supports several different list formats, including the WordNet format. Bonsai users can programmatically add the WordNet synonyms list, however, there is no mechanism to upload a file of WordNet lists to a cluster. WordNet synonyms are maintained in a Prolog file called `wn_s.pl`. To use these in your cluster, you'll need to [download](http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz) the WordNet archive and extract the `wn_s.pl` file. You'll then need to create your synonyms list by reading this file into a request to your cluster. The target index could be created with settings like so: [block:code] { "codes": [ { "code": "\n PUT https://username:password@my-awesome-cluster.us-east-1.bonsai.io/some_index\n \n {\n \"settings\": {\n \"analysis\": {\n \"filter\": {\n \"wn_synonym_filter\": {\n \"type\": \"synonym\",\n \"format\" : \"wordnet\",\n \"synonyms\" : [\n \"s(100000001,1,\"abstain\",v,1,0).\",\n \"s(100000001,2,\"refrain\",v,1,0).\",\n \"s(100000001,3,\"desist\",v,1,0).\",\n #... more synonyms, read from wn_s.pl file\n ]\n }\n },\n \"analyzer\": {\n \"my_synonyms\": {\n \"tokenizer\": \"standard\",\n \"filter\": [\n \"lowercase\",\n \"wn_synonym_filter\"\n ]\n }\n }\n }\n }\n }\n", "language": "json" } ] } [/block] There are a number of ways to generate this request. You could do it programmatically with a language like Python, or Bash scripts with curl, or any language with which you feel comfortable. ## Resources WordNet is a large subject and a great topic to delve deeper into. Here are some links for further reading: - WordNet, a Princeton Project https://wordnet.princeton.edu/wordnet/ - Download WordNet: https://wordnet.princeton.edu/wordnet/download/ - WordNet Documentation: https://wordnet.princeton.edu/wordnet/documentation/ - WordNet in other languages: http://globalwordnet.org/