Three ways to write the same query

Let’s say you want to find all patents published in the last 10 years that have the word “dog” in their titles or abstracts, and whose assignees are located in either the US or Canada. Here are three ways you could write such a query:

  1. Use a string:
query_v_1 <-
  '{"_and":[
          {"_gte":{"patent_date":"2007-03-01"}},
          {"_or":[
            {"_text_all":{"patent_title":"dog"}},
            {"_text_all":{"patent_abstract":"dog"}}
          ]},
          {"_or":[
            {"_eq":{"assingee_country":"US"}},
            {"_eq":{"assingee_country":"CA"}}
          ]}
  ]}'
  1. Use a list:
query_v_2 <- 
  list("_and" = 
       list(
          list("_gte" = list(patent_date = "2007-03-01")),
          list("_or" = 
                 list(
                   list("_text_all" = list(patent_title = "dog")),
                   list("_text_all" = list(patent_abstract = "dog"))
                   )
               ),
          list("_or" = 
                 list(
                   list("_eq" = list(assingee_country = "US")),
                   list("_eq" = list(assingee_country = "CA"))
                   )
               )
      )
  )
  1. Use the patentsview domain specific language (DSL):
library(patentsview)

query_v_3 <- 
  with_qfuns(
    and(
      gte(patent_date = "2007-03-01"),
      or(
        text_all(patent_title = "dog"),
        text_all(patent_abstract = "dog")
      ),
      eq(assingee_country = c("US", "CA"))
    )
  )

Why use the DSL?

We can see that all three versions of the query shown above are equivalent:

jsonlite::minify(query_v_1)
#> {"_and":[{"_gte":{"patent_date":"2007-03-01"}},{"_or":[{"_text_all":{"patent_title":"dog"}},{"_text_all":{"patent_abstract":"dog"}}]},{"_or":[{"_eq":{"assingee_country":"US"}},{"_eq":{"assingee_country":"CA"}}]}]}
jsonlite::toJSON(query_v_2, auto_unbox = TRUE)
#> {"_and":[{"_gte":{"patent_date":"2007-03-01"}},{"_or":[{"_text_all":{"patent_title":"dog"}},{"_text_all":{"patent_abstract":"dog"}}]},{"_or":[{"_eq":{"assingee_country":"US"}},{"_eq":{"assingee_country":"CA"}}]}]}
jsonlite::toJSON(query_v_3, auto_unbox = TRUE)
#> {"_and":[{"_gte":{"patent_date":"2007-03-01"}},{"_or":[{"_text_all":{"patent_title":"dog"}},{"_text_all":{"patent_abstract":"dog"}}]},{"_or":[{"_eq":{"assingee_country":"US"}},{"_eq":{"assingee_country":"CA"}}]}]}

…So why would you ever want to use method 3 over methods 1 or 2? There are two main reasons:

1. Query validation

search_pv() will check your query for errors if you use methods 2 or 3. This is not the case for method 1, where you would have to rely on the API’s error messages for guidance if your query is invalid. search_pv() checks queries for the following:

  • The fields included in your query are queryable for the endpoint (i.e., the field can be used in the user query). For example, it would make sure that assingee_country can be used in the query argument if you sent the above query to the patents endpoint.
  • The fields in your query are compatible with the comparison operators you used. For example, it would confirm that the text_all operator was used with a field whose type was “full text” (patent_title above).
  • You supplied the correct value type for the field (e.g., patent_date is a character, not an integer).

2. Concise, easy to use syntax for complex queries

Methods 1 and 3 are both shorter than method 2, making them quicker. It’s also a lot easier to get the JSON syntax correct when using method 3 compared to method 1, because you don’t have to write any JSON at all using the DSL…This is important because the API is fairly picky about the query syntax, so it’s not trivial to get it correct. For example, the API will throw an error if you use a box in your JSON when is not absolutely necessary, even if your query is still valid JSON (e.g., query = {"_gte":{"patent_date":["2007-03-01"]}} will throw an error).

Compared to method 1, method 3 will correctly “or” together values if you put them in a vector. For example, in the query shown above, a vector of two values was given for assingee_country (c("US", "CA")). This safely converted the single “equals” statement in the third element of the query (eq(assingee_country = c("US", "CA"))) to two separate equals statements that got or’d together.1

Basics of the language

All of the functions that make up the DSL are found in the qry_funs list (e.g., qry_funs$eq()). You can evaluate code in the context of this list using the function with_qfuns() (see ?with_qfuns() for an example that demonstrates how with_qfuns() can save you typing). There are three types of functions in qry_funs:

  1. Comparison operator functions (eq, neq, gt, gte, lt, lte, begins, contains, text_all, text_any, text_phrase). These functions are used to compare a field to a value. For example, using the “less than or equal to” function (lte), we can filter out patents published after some date (e.g., query = qry_funs$lte(patent_date = "2001-01-05")). See the “comparison operators” section of the API’s query language page for a description of the 11 comparison operators. One important thing to keep in mind is that certain comparison operators only work with certain data types. For example, you can’t use the begins function on patent_abstract because patent_abstract is of data type “full text” and begins only works with fields of data type “string.”
  2. Array functions (and and or). You can use these functions to logically combine your calls to the comparison operators. For example, we can require that the patent date is less than or equal to 2001-01-05 and the inventor’s last name is “Ihaka” (query = with_qfuns(and(lte(patent_date = "2001-01-05"), eq(inventor_last_name = "Ihaka")))).
  3. not function (not). This function negates a comparison. For example, we could search for patents that don’t have the word “hi” in their titles like this: qry_funs$not(qry_funs$text_phrase(patent_title = "hi")).

Query examples

The following queries are intended for the patents endpoint

Patents linked to an assignee with 10 or fewer distinct (and disambiguated) inventors:

qry_funs$lte(assignee_total_num_inventors = 10)
#> {"_lte":{"assignee_total_num_inventors":10}}

Patents assigned to the “CPC subsection”2 of G12 (physics instruments):

qry_funs$eq(cpc_subsection_id = "G12")
#> {"_eq":{"cpc_subsection_id":"G12"}}

Patents that:

  • Have an inventor listed on them whose first name contains “joh” AND
  • Have an abstract containing either the phrase “dog bark” or “cat meow” AND
  • Have an abstract that doesn’t have the phrase “dog chain” in it:
with_qfuns(
  and(
    contains(rawinventor_first_name = "joh"),
    text_phrase(patent_abstract = c("dog bark", "cat meow")),
    not(
      text_phrase(patent_abstract = c("dog chain"))
    )
  )
)
#> {"_and":[{"_contains":{"rawinventor_first_name":"joh"}},{"_or":[{"_text_phrase":{"patent_abstract":"dog bark"}},{"_text_phrase":{"patent_abstract":"cat meow"}}]},{"_not":{"_text_phrase":{"patent_abstract":"dog chain"}}}]}

Patents that:

  • Have an inventor listed on them whose last name is “Smith” AND
  • Have “cotton gin” in their title

OR

  • Have an inventor listed on them whose last name is “Hopper” AND
  • Have “COBOL” in their title
with_qfuns(
  or(
    and(
      eq(inventor_last_name = "smith"),
      text_phrase(patent_title = "cotton gin")
    ),
    and(
      eq(inventor_last_name = "hopper"),
      text_phrase(patent_title = "COBOL")
    )
  )
)
#> {"_or":[{"_and":[{"_eq":{"inventor_last_name":"smith"}},{"_text_phrase":{"patent_title":"cotton gin"}}]},{"_and":[{"_eq":{"inventor_last_name":"hopper"}},{"_text_phrase":{"patent_title":"COBOL"}}]}]}

  1. One may note that using “value arrays” is supposedly supported by the API. For example, the API documentation gives the following query as an example of their use: '{"inventor_last_name":["Whitney","Hopper"]}'. The problem with this is that the API is not consistent in its handling of value arrays. For many of the comparison operators, one cannot “or” together values like this using arrays. Thus, the DSL in patentsview never relies on arrays when creating queries.

  2. PatentsView gets the names of the CPC hierarchy wrong. For example, a “CPC subsection” according to PatentsView is actually a CPC class.