Protocol

The Searsia protocol is a RESTful interface. The calls follow the API templates introduced before, that take a query and output JSON search results with three fields: "resource" containing the resource definition; "hits" containing the search results, and "searsia" containing the version number.

The resource definition tells Searsia how to get search results from an external source. Searsia copies resource definitions from other Searsia search engines, or from files from the local file system.

This page explains how to configure Searsia by creating search engine definition files. Searsia can add search engines that follow Searsia's JSON format, as well as several other formats, including HTML.

Adding the main engine

We will start by adding the main engine by making a file index.json. The minimum information needed to add a search engine are:

  1. The engine's definition (called "resource"), which needs the engine's identifier ("id") that should match the file name, hence: "index" (excluding the file name extension).
  2. The Searsia version number ("searsia"), which will be v1.0.0.
  3. An API template ("apitemplate") for getting the search results, or alternatively, the search results themselves, called "hits". In this case, we add three hits containing each only an engine identifier (called "rid" for "resource identifier"): "engine1", "engine2", and "engine3"

The file index.json contains:

{
  "resource": {
    "id": "index",
    "name": "My Engine"
   },
   "searsia": "v1.0.0",
   "hits": [
     { "rid": "engine1" },
     { "rid": "engine2" },
     { "rid": "engine3" }
   ]
}

We can test our engine definition index.json as follows:

java -jar searsiaserver.jar -m index.json -t json

which gives the following output:

Searsia server v1.0.2
Testing: index (My Engine)
{"hits": [ {"rid": "engine1"}, {"rid": "engine2"}, {"rid": "engine3"} ]}
Warning: less than 10 results for query: searsia; see "testquery" or "rerank".
Test succeeded.

So, our test succeeded. Instead of testing the JSON output of a single engine, we might also test all engines in the federated search engine as follows:

java -jar searsiaserver.jar -m index.json -t all

which results in 3 errors, because we have not yet added the engines engine1, engine2 and engine3.

Searsia server v1.0.2
Testing: index (My Engine)
Warning: less than 10 results for query: searsia; see "testquery" or "rerank".
Testing: engine1
Test failed: FileNotFoundException: engine1.json (No such file or directory)
Testing: engine2
Test failed: FileNotFoundException: engine2.json (No such file or directory)
Testing: engine3
Test failed: FileNotFoundException: engine3.json (No such file or directory)
ERROR: Test failed: 3 engines failed.

Let's add some more example engines below.

Adding a Searsia engine

To add an external Searsia resource (an engine that implements the Searsia protocol, returning results of mime-type application/searsia+json), add the engines API template to the resource definition.

The file engine1.json contains:

{
  "resource": {
    "id": "engine1",
    "name": "Sheet Music",
    "favicon": "https://drsheetmusic.com/images/drsheetmusic-note.png",
    "apitemplate": "https://drsheetmusic.com/searsia/index.json?q={searchTerms?}&page={startPage?}"
   },
   "searsia": "v1.0.0",
}

The search engine definition may again be tested with:

java -jar searsiaserver.jar -m engine1.json -t json

Adding an HTML engine

Searsia can provide search results by scraping the HTML that search engines return for their end users. To add an HTML resource, use text/html in the field "mimetype". As the field "apitemplate", take the resource's URL from your browser after querying the search engine, replacing the query in the url by {searchTerms}. If the URL does not contain a query, the search engine probably uses a POST request. To find out what the POST request is, we recommend Live HTTP Headers for Firefox. Put the POST string, replacing the query by {searchTerms} in the field "post". For a JSON post request, additionally use "postencode": "application/json"

Searsia uses XPath 1.0 to extract the search results from the web page. XPath is a query language for selecting elements of semi-structured data. Suppose the search results are displayed as list elements on the page, then these are encode as <li> ... </li> on the page, and they can be extracted from the page with the XPath query //li. To find the most likely XPath query, we recommend Search Result Finder for Firefox. Fill in the XPath query in the field "itempath". To tell Searsia how to extract the components of the search result, add "extractors" for the fields title, description, link, or any other field you like (the client also supports image).

We recommend to incrementally define your HTML search engine by first adding a random "itempath" XPath query, for instance //foobar. Add a test query "testquery" for which you are sure the search engine gives at least 10 results.

First attempt to incrementally define engine2.json:

{
  "resource": {
    "id": "engine2",
    "name": "Djoerd's Page",
    "apitemplate": "http://wwwhome.cs.utwente.nl/~hiemstra/?s={searchTerms}",
    "mimetype": "text/html",
    "testquery": "federated",
    "itempath": "//foobar"
   },
   "searsia": "v1.0.0",
}

Then test engine2.json with -t xml as follows. (This first attempt will not work, because the "itempath" is not yet correct, and the "extractors" that select the title, description and URL are still missing.)

java -jar searsiaserver.jar -m engine2.json -t xml

Tip: nicely format the XML with a tool like xmllint.

Because the API template and mimetype are correct, this test outputs the HTML search result page converted to XML. We can now see that the search results on this page are using <h3> header tags. These header tags contain the search title (XPath extractor: .), it contains the URL in the <a> anchor tag's href attribute (XPath extractor: ./a/@href), and the first text node that follows is the description (XPath extractor: ./following-sibling::text()[1]).

After editing, the file engine2.json contains:

{
  "resource": {
    "id": "engine2",
    "name": "Djoerd's Page",
    "apitemplate": "http://wwwhome.cs.utwente.nl/~hiemstra/?s={searchTerms}",
    "mimetype": "text/html",
    "testquery": "federated",
    "itempath": "//h3",
    "extractors": {
      "title": ".",
      "url": "./a/@href",
      "description": "./following-sibling::text()[1]"
    }
   },
   "searsia": "v1.0.0",
}

Testing the engine with JSON output gives the following result:

java -jar searsiaserver.jar -m engine2.json -t json

Searsia server v1.0.2
Warning: Mother changed to http://wwwhome.cs.utwente.nl/~hiemstra/?s={searchTerms}
Testing: engine2 (Djoerd's Page)
{ "hits": [
  {
   "title": "Federated Search for Sheet Music",
   "description": "Dr. Sheet Music is a federated search engine for sheet music.",
   "url": "http://www.cs.utwente.nl/~hiemstra/2017/federated-search-for-sheet-music.html"
  },
(...)
]}
Test succeeded.

Adding an XML or JSON resource

To add an XML resource, fill in application/xml in the field "mimetype". Then proceed as above. The Firefox Search Result Finder cannot be used in this case.

To add a JSON resource, fill in application/json in the field "mimetype". Searsia also uses XPath queries to interpret JSON output, by internally converting JSON to XML, where each JSON attribute name is converted to an XML element; JSON lists are converted to repeated XML elements with the JSON list's name. As above, test the resource with the -t xml switch for debugging.

The following example selects only the hits from a JSON search engine (Actually, a search engine that produces Searsia's JSON) that have a URL. This definition selects part of the search results, because a Searsia engine might also show results without a URL, and it might show more fields than the once selected here:

The file engine3.json contains:

{
  "resource": {
    "id": "engine3",
    "name": "UT Search",
    "apitemplate": "https://search.utwente.nl/searsia/index?q={searchTerms}",
    "mimetype": "application/json",
    "testquery": "campus",
    "itempath": "//hits[./url]",
    "extractors": {
      "title": "./title",
      "url": "./url",
      "description": "./description"
    }
   },
   "searsia": "v1.0.0",
}

Now we can test all engines as follows:

java -jar searsiaserver.jar -m index.json -t all

Searsia server v1.0.2
Testing: index (My Engine)
Warning: less than 10 results for query 'searsia'; see "testquery" or "rerank".
Testing: engine1 (Sheet Music)
Testing: engine2 (Djoerd's Page)
Testing: engine3 (UT Search)
Test succeeded.

Examples

Searsia supports many API's by including API keys as secret parameters that will not be shared, as well as the possibility to add custom HTTP headers. Look at the API of UT Search at University of Twente Searsia API for examples of Searsia's resource configurations, including several examples that use HTML scrapers, and examples for accessing the API's of Google, Twitter, Facebook, Flickr, Instagram, and more. If you believe that Searsia is unable to get search results from an existing resource that should be supported, please post your question under Searsia Server Issues. Please, note that Searsia is not meant to scrape sites that do not want to be scraped, and therefore does not contain ways to circumvent for instance session cookies.

While the Searsia resource configurations provide a way to get the search results for a great variety of existing search engines, Searsia also provides a flexible way to structure the search results from these engines. The search results, i.e., the objects in the "hits" list, may contain any attribute that seems appropriate, for instance an attribute "phone_number" for a telephone directory or an attribute "nr_of_citations" for a search engine that searches scientific papers. The following attributed are reserved:

  • "title": The title of the search results, that can be clicked to go to the web page that was found. Usually, the title is equals to the title of the web page that was found. The title is the only attribute that is mandatory.
  • "url": The link to the web page that was found.
  • "description": A small summary describing the result. This might be a snippet from the web site containing the query, or some other summary.
  • "image": The url of a (thumbnail) image, to be displayed with the search result.

The Results demo mockup below shows 7 ways to present the search results from Wikipedia's search suggestions, that is the mockup shows the same search results 7 times using different configurations.

The 7 results presentations are achieved as follows:

  • wikididyoumean, which' name is "Did you mean:", returns a single search result, that contains the title as well as "tags":"#suggestion", which tells the client to display the result as a query suggestion.
  • wikifull (Wikipedia Pages) returns title, description and URL. In this configuration, the domain of "urltemplate" (wikipedia.net) does not match the results' domains (wikipedia.org). Therefore, the client displays the URLs for each aggregated result. The "urltemplate" is the url that the user will use to search on the site, whereas the "apitemplate" will be used by the server.
  • wikiimage (Wikipedia Images) returns title and image and "tags":"#image", which tells the client to display the results as an image result. Note how the XPath functions concat() and substring-after() are used to create a custom image URL.
  • wikismall2, which is called Search Wikipedia for, does not return URLs, which makes the client infer the URL from the "title" and the "urltemplate", effectively creating a search engine that spawns a search on Wikipedia.
  • wikifull2 (Wikipedia Again) returns full search results with a thumbnail image, much like the Wikipedia Pages engine above.
  • wikismall (Wikipedia Small) returns search results with "tags":"#small", telling the client to display the results on a single line.
  • wikirelated is called Related searches:; It returns only the titles and "tags":"#small". The header Related searches: cannot be clicked because the resource does not configure the "urltemplate" for the end user.

Note that each configuration uses the same "apitemplate": Each of the 7 results effectively use the exact same search engine. Please do not use this example in a actual server configuration. Note that the configuration of the mockup, if used in an actual Searsia Server, would send each query 7 times(!) to Wikipedia.

An overview of all resource fields

The table below contains a quick reference for all fields that are supported inside the "resource" field:

ParameterExplanation
"apitemplate" * A URL specified following the searsia URL template syntax, to be used by the server.
"deleted" A boolean. Value: true (no quotes) if the resource is deleted (only used in files; the server will use HTTP status 410, Gone)
"extractors" Field names and XPath queries for selecting parts of search results such as the title, url, and description. For example, the title would typically be selected as the first anchor text in a search result, i.e., (.//a)[1]. The XPath queries are evaluated with respect to an Item XPath context node, and they typically start with . (the 'self' axis step).
"favicon" The url of the icon image, to be displayed to the user. Icons should have equal width and height. Icons are preferably png files, not smaller than 48x48 pixels.
"headers" HTTP headers to be sent to the API Template, consisting of a field name (without ':') and the field value. Like API Template and Post String, the field value may include parameters.
"id" * A unique string identifying this resource. Should be locally unique (within the server).
"itempath" An XPath 1.0 query that selects the search results from an HTML or XML result. XPath is also used to select results from JSON results, assuming a standard conversion to XML, where JSON lists are converted to repeated XML elements.
"maxqueriesperday" The maximum amount of queries per day that the search engine is allowed to query. If the maximum per day is reached, the engine gives an HTTP status 503, Service Unavailable. Default: 1000.
"mimetype" The format returned by the API Template. Supported formats are: application/searsia+json, text/html, application/xml, application/json. If omitted, the mime type application/searsia+json is assumed.
"name" A short name for this resource, to be displayed to the user.
"post" Only set if the API template HTTP method is POST, empty if the HTTP method is GET. Like the template, the post string may include parameters, delimited by curly brackets.
"postencode" Will encode the post string using a certain mime-type. Possible values: application/x-www-form-urlencoded (default) or application/json.
"prior" A value between 0 and 1 indicating the prior score of a resource to be selected (prior to knowing the query). This score is added to the score that is computed for the result's match with the query. (Integer, so no quotes)
"privateparameters" Secret parameter names and their values. Occurrences of the parameter names in the API Template, Post String or Headers will be replaced by the value. The server will not share the parameter names and values with other clients, so it is safe to use them for API keys and secrets.
"rerank"Specifies a ranking algorithm or filter that is used to rerank/refilter the search results. Currently, random (randomly reorder the returned results), best (reorder by best match of query with result title and description), and bestrandom (reorder by best match, supplemented with random results if less than 10 matching results are found) are supported. All other strings are treated as best.
"signature"Used to sign API requests. Takes a function and optionally a private parameter. Possible value: HmacSHA256({key}) (used for Amazon's affiliate program).
"testquery" A query for which a search gives a non-empty result. If not set, the system should give a non-empty result for the query searsia.
"usertemplate" A url specified following the Opensearch url template syntax, to be used by users. The mime type of this url must be text/html or application/xml+xhtml.

* mandatory fields

Trouble shooting

If your error is: SLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target: Your Java version does not have the right SSL certificate to connect to the site. Updating Java might solve this problem. If not, get the certificate in Firefox by clicking on the lock in the browser address bar. Choose "More Information", "View Certificate", "General Details", "Export" and save the certificate (for instance as MySite.crt). Then, on your machine, locate your java certificate files; On Ubuntu 18.04, they can be found here: /usr/lib/jvm/default-java/lib/security/cacerts. Then update your certificates as follows: sudo keytool -import -noprompt -trustcacerts -alias MySite -file MySite.crt -keystore /usr/lib/jvm/default-java/lib/security/cacerts (use the password: changeit)