Boolean Searching

Boolean search is a powerful method available in search to help you find the needle in the haystack, by improving the relevancy of the object records returned from a query. But boolean search can escalate in complexity once you start combining its operators, and the subtle interactions between them can become challenging to predict for even those with the most logical of minds.

The boolean operators available are:

  • Phrase Searching - e.g. ‘“William Morris”’

  • Negated Searching - e.g. ‘-silver’

  • AND/OR searching - ‘gelato rome’ (AND), ‘gelato|rome’ (OR)

  • Fuzzy searching (booleanish) - ‘graffito~2’

These are not new query parameters, they are passed within the ‘q’ parameter value used for general text search (so they will search all fields)

Phrase searching

Perhaps the most simple boolean operator to understand, phrase searching allows you to specify that you want an exact phrase (a word or a sequence of words) to be matched, exactly as written. Normally if you search for multiple words, any occurance of the words in isolation anywhere in the record will be considered a match. For example, if you want to search for any objects that contain the words ‘William Morris’, you may not want to see object records mentioning the (hyperthetical) partnership of “William Smith” and “Jane Morris” but these would be returned, as they have the words ‘William’ and ‘Morris’ somewhere within them. Instead, if you search using double quotes around the words (‘“William Morris”’), you will now see a smaller number of results that have that exact phrase in it.

Note

Of course, if you really want to only find objects connected to the Victorian Artist & Designer William Morris, you would be better use a filter either on the person, or on the maker if you are only interested in objects he was involved in creating, see filtering:top) using the controlled vocab. identifier for William Morris (A8676).


Example of non-Phrase searching

import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=William Morris')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words 'William' and 'Morris' somewhere in the record")
There are 2436 objects that have the words 'William' and 'Morris' somewhere in the record

Example of Phrase searching

import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q="William Morris"')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f'There are {record_count} objects that have the phrase "William Morris" somewhere in the record')
There are 2239 objects that have the phrase "William Morris" somewhere in the record

Negated Searching

Also fairly simple to understand, negated search allows you to exclude object records you do not want to see, even though they would match otherwise. This is most commonly used in tandem with another operator or with a query, as otherwise you are likely to return a very large number of results (for example returning all records that do not mention pineapples would be over a million records and only of use to people with an irrational fear of pineapple).

You can apply negation to single words in a text search (e.g. find object records that contain the word ‘blue’ but do not contain the word ‘azure’), to phrase search (as described above) and to Filtering (e.g. find records that are not associated with the maker William Morris )

All you need to do apply negation is to prefix the word, phrase or filter term with a hyphen ‘-‘ (there must not be a space between the hyphen and the word/phrase/filter). Some examples of it in usage:

import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=blue')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word blue somewhere in the record")
There are 62861 objects that have the word blue somewhere in the record
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=blue azure')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words blue and azure somewhere in the record")
There are 128 objects that have the words blue and azure somewhere in the record
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=blue -azure')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word 'blue' and not the word 'azure' somewhere in the record")
There are 62733 objects that have the word 'blue' and not the word 'azure' somewhere in the record

AND/OR Searching

One of the key issues in searching within a large set of data is whether you want to have a broad set of results, some of which might be correct (and some not), or whether you want a smaller set of results which are mostly correct (but may be missing out on some other correct results).

Using AND or OR in your search lets you decide which of these approaches to take, by indicating when you search for multiple words whether you want to find object records that contain all the search words (AND) , or if you are happy as well to see object records that contain only some of the words (OR). Up until now, without even telling you, we have been defaulting your searches to use AND, so only object records that match on all the terms are returned. But you can switch this to OR by using the ‘|’ operator between the search words.

One particular use for OR searching is searching on variations of a topic, allowing you to find mentions of either or both words in one search query. For example if you want to find all records that must mention ‘photography’ and one or both of ‘travel’ and ‘landscape’ you could run two AND queries:

Examples of AND searching

import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=travel photography')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words 'travel' and 'photography' somewhere in the record")
There are 12696 objects that have the words 'travel' and 'photography' somewhere in the record
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=landscape photography')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words 'landscape' and 'photography' somewhere in the record")
There are 6693 objects that have the words 'landscape' and 'photography' somewhere in the record

and then you can combine the results (removing the duplicates when a record would have matched both queries, as it contains all three words). A faster way to acheieve the same thing though is one OR query

Example of OR searching

import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=(travel|landscape) photography')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words 'travel' and/or 'landscape' and 'photography' somewhere in the record")
There are 14426 objects that have the words 'travel' and/or 'landscape' and 'photography' somewhere in the record

which then returns all the records that:

* Mention 'travel' and 'photograpy'
* Mention 'landscape' and 'photography'
* Mention 'travel' and 'landscape' and 'photography'

(The number for the OR query is not the sum of the first two AND queries, as object records that mention all three words will be duplicated in the results of the first two AND queries)

Fuzzy Searching

This is not strictly a boolean operator, but it’s close enough to be considered so and can be combined with boolean operators. It lets you run a search query with a certain degree of variation allowed in matching, allowing you to find records that might have slightly different spellings across times & cultures, dealing with singular and plural, catching typos, etc. For example, you might want to find any variations of how Shakespeare mat have been spelt:

import requests

req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Shakespeare')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word 'Shakespeare' somewhere in the record")
There are 5735 objects that have the word 'Shakespeare' somewhere in the record
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Shakespeare~1')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word 'Shakespeare' (with a fuzzy distance of 1 character change allowed) somewhere in the record")
There are 5735 objects that have the word 'Shakespeare' (with a fuzzy distance of 1 character change allowed) somewhere in the record
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Shakespeare~2')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word 'Shakespeare' (with a fuzzy distance of 2 character changes allowed) somewhere in the record")
There are 5838 objects that have the word 'Shakespeare' (with a fuzzy distance of 2 character changes allowed) somewhere in the record

As you can see from the increasing numbers, as you increase the fuzzy distance more records are returned. So for example the 2nd search would also include ‘Shakespere’ (one character removed), the 3rd search would also include ‘Shakspere’ (two characters removed).

Of course, the danger is as you increase the fuzzy distance, the more likely irrelevant results are returned. So equally valid for the 3rd search would be matches on ‘Wakespear’, ‘Zakespeare’, ‘Rhakezpeare’ and so on. The shorter the word, the less useful fuzzy searching is, as too many irrelevant results are likely to be returned.

Boolean Operators in Combination

By combining boolean operators together you can emulate filtering operations, returning only exact matches from a query, but these results can sometimes be hard to interpret.

For example, you might want to find:

  • All records mentioning ‘Yorkshire’ (or minor variations on the spelling to within a 2 character distance inclusive)

  • But not if they mention ‘Scarborough’

  • But they must also mention ‘chair’

Let’s build this up step-by-step.

req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Yorkshire~2')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} object records that mention Yorkshire (within a 2 character distance) somewhere in the record")
There are 7145 object records that mention Yorkshire (within a 2 character distance) somewhere in the record
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Yorkshire~2 -Scarborough')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} object records that mention Yorkshire (within a 2 character distance) without mentioning Scarborough somewhere in the record")
There are 7091 object records that mention Yorkshire (within a 2 character distance) without mentioning Scarborough somewhere in the record
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Yorkshire~2 -Scarborough chair')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} object records that mention Yorkshire (within a 2 character distance) without mentioning Scarborough and also mention a chair")
There are 4147 object records that mention Yorkshire (within a 2 character distance) without mentioning Scarborough and also mention a chair

Note

If you are using negation and boolean OR operators together, the order they are written may affect your results. For a very detailed investigation of this see https://github.com/elastic/elasticsearch/issues/4707`