The Searsia protocol is a RESTful interface. The calls follow the API
templates introduced before, that take a query and output JSON search results with three fields:
"resource"
containing the resource definition; "hits"
containing
the search results, and "searsia"
containing the version number.
The resource definition tells Searsia how to get search results from an external source. Searsia copies resource definitions from other Searsia search engines, or from files from the local file system.
This page explains how to configure Searsia by creating search engine definition files. Searsia can add search engines that follow Searsia's JSON format, as well as several other formats, including HTML.
We will start by adding the main engine by making a file index.json
.
The minimum information needed to add a search engine are:
"resource"
), which needs the engine's identifier ("id"
) that should match the file name, hence: "index"
(excluding the file name extension)."searsia"
), which
will be v1.0.0
.
"apitemplate"
) for getting the search results, or alternatively, the search results themselves, called "hits"
. In this case, we add three hits containing each only an engine identifier (called "rid"
for "resource identifier"): "engine1"
, "engine2"
, and "engine3"
The file index.json
contains:
{ "resource": { "id": "index", "name": "My Engine" }, "searsia": "v1.0.0", "hits": [ { "rid": "engine1" }, { "rid": "engine2" }, { "rid": "engine3" } ] }
We can test our engine definition index.json
as follows:
java -jar searsiaserver.jar -m index.json -t json
which gives the following output:
Searsia server v1.0.2
Testing: index (My Engine)
{"hits": [
{"rid": "engine1"},
{"rid": "engine2"},
{"rid": "engine3"}
]}
Warning: less than 10 results for query: searsia; see "testquery" or "rerank".
Test succeeded.
So, our test succeeded. Instead of testing the JSON output of a single engine, we might also test all engines in the federated search engine as follows:
java -jar searsiaserver.jar -m index.json -t all
which results in 3 errors, because we have not yet added the engines engine1
, engine2
and engine3
.
Searsia server v1.0.2
Testing: index (My Engine)
Warning: less than 10 results for query: searsia; see "testquery" or "rerank".
Testing: engine1
Test failed: FileNotFoundException: engine1.json (No such file or directory)
Testing: engine2
Test failed: FileNotFoundException: engine2.json (No such file or directory)
Testing: engine3
Test failed: FileNotFoundException: engine3.json (No such file or directory)
ERROR: Test failed: 3 engines failed.
Let's add some more example engines below.
To add an external Searsia resource (an engine that implements the Searsia protocol, returning results of mime-type application/searsia+json
), add the engines API template to the resource definition.
The file engine1.json
contains:
{ "resource": { "id": "engine1", "name": "Sheet Music", "favicon": "https://drsheetmusic.com/images/drsheetmusic-note.png", "apitemplate": "https://drsheetmusic.com/searsia/index.json?q={searchTerms?}&page={startPage?}" }, "searsia": "v1.0.0", }
The search engine definition may again be tested with:
java -jar searsiaserver.jar -m engine1.json -t json
Searsia can provide search results by
scraping
the HTML that search engines return for their end users. To add an HTML
resource, use text/html
in the field
"mimetype"
.
As the field "apitemplate"
,
take the resource's URL from your browser after querying the search engine,
replacing the query in the url by {searchTerms}
. If the URL
does not contain a query, the search engine probably uses a
POST
request. To find out what the POST request is, we recommend
Live HTTP Headers for Firefox.
Put the POST string, replacing the query by {searchTerms}
in the
field "post"
.
For a JSON post request, additionally use
"postencode": "application/json"
Searsia uses XPath 1.0 to extract the search results from the web page. XPath is a query language for selecting elements of semi-structured data. Suppose the search results are displayed as list elements on the page, then these are encode as <li>
... </li>
on the page, and they can be extracted from the page with the XPath query //li
. To find the most likely XPath query, we recommend Search Result Finder for Firefox. Fill in the XPath query in the field "itempath"
. To tell Searsia how to extract the components of the search result, add "extractors"
for the fields title
, description
, link
, or any other field you like (the client also supports image
).
We recommend to incrementally define your HTML search engine by first
adding a random "itempath"
XPath query, for instance
//foobar
. Add a test query "testquery"
for which you are sure the search engine gives at least 10 results.
First attempt to incrementally define engine2.json
:
{ "resource": { "id": "engine2", "name": "Djoerd's Page", "apitemplate": "http://wwwhome.cs.utwente.nl/~hiemstra/?s={searchTerms}", "mimetype": "text/html", "testquery": "federated", "itempath": "//foobar" }, "searsia": "v1.0.0", }
Then test engine2.json
with -t xml
as follows.
(This first attempt will not work, because the "itempath"
is not yet correct, and the "extractors"
that select the
title, description and URL are still missing.)
java -jar searsiaserver.jar -m engine2.json -t xml
Tip: nicely format the XML with a tool like xmllint.
Because the API template and mimetype are correct, this test outputs
the HTML search result page converted to XML. We can now see that the
search results on this page are using <h3> header tags. These
header tags contain the search title (XPath extractor: .
),
it contains the URL in the <a> anchor tag's href attribute
(XPath extractor: ./a/@href
), and the first text node that
follows is the description (XPath extractor:
./following-sibling::text()[1]
).
After editing, the file engine2.json
contains:
{ "resource": { "id": "engine2", "name": "Djoerd's Page", "apitemplate": "http://wwwhome.cs.utwente.nl/~hiemstra/?s={searchTerms}", "mimetype": "text/html", "testquery": "federated", "itempath": "//h3", "extractors": { "title": ".", "url": "./a/@href", "description": "./following-sibling::text()[1]" } }, "searsia": "v1.0.0", }
Testing the engine with JSON output gives the following result:
java -jar searsiaserver.jar -m engine2.json -t json
Searsia server v1.0.2
Warning: Mother changed to http://wwwhome.cs.utwente.nl/~hiemstra/?s={searchTerms}
Testing: engine2 (Djoerd's Page)
{ "hits": [
{
"title": "Federated Search for Sheet Music",
"description": "Dr. Sheet Music is a federated search engine for sheet music.",
"url": "http://www.cs.utwente.nl/~hiemstra/2017/federated-search-for-sheet-music.html"
},
(...)
]}
Test succeeded.
To add an XML resource, fill in application/xml
in the field
"mimetype"
. Then proceed as
above. The Firefox Search Result Finder cannot
be used in this case.
To add a JSON resource, fill in application/json
in the
field "mimetype"
. Searsia also
uses XPath queries to interpret JSON output, by internally converting JSON
to XML, where each JSON attribute name is converted to an XML element;
JSON lists are converted to repeated XML elements with the JSON list's
name. As above, test the resource with the -t xml
switch
for debugging.
The following example selects only the hits from a JSON search engine (Actually, a search engine that produces Searsia's JSON) that have a URL. This definition selects part of the search results, because a Searsia engine might also show results without a URL, and it might show more fields than the once selected here:
The file engine3.json
contains:
{ "resource": { "id": "engine3", "name": "UT Search", "apitemplate": "https://search.utwente.nl/searsia/index?q={searchTerms}", "mimetype": "application/json", "testquery": "campus", "itempath": "//hits[./url]", "extractors": { "title": "./title", "url": "./url", "description": "./description" } }, "searsia": "v1.0.0", }
Now we can test all engines as follows:
java -jar searsiaserver.jar -m index.json -t all
Searsia server v1.0.2
Testing: index (My Engine)
Warning: less than 10 results for query 'searsia'; see "testquery" or "rerank".
Testing: engine1 (Sheet Music)
Testing: engine2 (Djoerd's Page)
Testing: engine3 (UT Search)
Test succeeded.
Searsia supports many API's by including API keys as secret parameters that will not be shared, as well as the possibility to add custom HTTP headers. Look at the API of UT Search at University of Twente Searsia API for examples of Searsia's resource configurations, including several examples that use HTML scrapers, and examples for accessing the API's of Google, Twitter, Facebook, Flickr, Instagram, and more. If you believe that Searsia is unable to get search results from an existing resource that should be supported, please post your question under Searsia Server Issues. Please, note that Searsia is not meant to scrape sites that do not want to be scraped, and therefore does not contain ways to circumvent for instance session cookies.
While the Searsia resource configurations provide a way to get the search
results for a great variety of existing search engines, Searsia also
provides a flexible way to structure the search results from these engines.
The search results, i.e., the objects in the "hits"
list,
may contain any attribute that seems appropriate, for instance an
attribute "phone_number"
for a telephone directory or
an attribute "nr_of_citations"
for a search engine
that searches scientific papers. The following attributed are reserved:
"title"
: The title of the search results, that can be
clicked to go to the web page that was found. Usually, the title is equals
to the title of the web page that was found. The title is the only
attribute that is mandatory."url"
: The link to the web page that was found."description"
: A small summary describing the result.
This might be a snippet from the web site containing the query, or some
other summary."image"
: The url of a (thumbnail) image, to be
displayed with the search result.The Results demo mockup below shows 7 ways to present the search results from Wikipedia's search suggestions, that is the mockup shows the same search results 7 times using different configurations.
The 7 results presentations are achieved as follows:
"tags":"#suggestion"
, which tells the client to display the
result as a query suggestion.
"urltemplate"
(wikipedia.net) does not match the results'
domains (wikipedia.org). Therefore, the client displays the URLs for each
aggregated result. The "urltemplate"
is the url that the
user will use to search on the site, whereas the "apitemplate"
will be used by the server.
"tags":"#image"
, which tells the client to display the
results as an image result. Note how the XPath functions concat()
and substring-after()
are used to create a custom image URL.
"title"
and the "urltemplate"
,
effectively creating a search engine that spawns a search on Wikipedia.
"tags":"#small"
,
telling the client to display the results on a single line.
"tags":"#small"
.
The header Related searches: cannot be clicked because the resource does
not configure the "urltemplate"
for the end user.
Note that each configuration uses the same "apitemplate"
: Each
of the 7 results effectively use the exact same search engine.
Please do not use this example in a actual server configuration.
Note that the configuration of the mockup, if used in an actual Searsia Server,
would send each query 7 times(!) to Wikipedia.
The table below contains a quick reference for all fields that are supported inside the "resource"
field:
Parameter | Explanation |
---|---|
"apitemplate" * | A URL specified following the searsia URL template syntax, to be used by the server. |
"deleted" | A boolean. Value: true (no quotes) if the resource is deleted (only used in files; the server will use HTTP status 410, Gone) |
"extractors" | Field names and XPath queries for selecting parts of search results such as the title, url, and description. For example, the title would typically be selected as the first anchor text in a search result, i.e., (.//a)[1] . The XPath queries are evaluated with respect to an Item XPath context node, and they typically start with . (the 'self' axis step).
|
"favicon" | The url of the icon image, to be displayed to the user. Icons should have equal width and height. Icons are preferably png files, not smaller than 48x48 pixels. |
"headers" | HTTP headers to be sent to the API Template, consisting of a field name (without ':') and the field value. Like API Template and Post String, the field value may include parameters. |
"id" * | A unique string identifying this resource. Should be locally unique (within the server). |
"itempath" | An XPath 1.0 query that selects the search results from an HTML or XML result. XPath is also used to select results from JSON results, assuming a standard conversion to XML, where JSON lists are converted to repeated XML elements. |
"maxqueriesperday" | The maximum amount of queries per day that the search engine is allowed to query. If the maximum per day is reached, the engine gives an HTTP status 503, Service Unavailable. Default: 1000. |
"mimetype" | The format returned by the API Template. Supported formats are: application/searsia+json , text/html , application/xml , application/json . If omitted, the mime type application/searsia+json is assumed. |
"name" | A short name for this resource, to be displayed to the user. |
"post" | Only set if the API template HTTP method is POST, empty if the HTTP method is GET. Like the template, the post string may include parameters, delimited by curly brackets. |
"postencode" | Will encode the post string using a certain mime-type. Possible values: application/x-www-form-urlencoded (default) or application/json . |
"prior" | A value between 0 and 1 indicating the prior score of a resource to be selected (prior to knowing the query). This score is added to the score that is computed for the result's match with the query. (Integer, so no quotes) |
"privateparameters" | Secret parameter names and their values. Occurrences of the parameter names in the API Template, Post String or Headers will be replaced by the value. The server will not share the parameter names and values with other clients, so it is safe to use them for API keys and secrets. |
"rerank" | Specifies a ranking algorithm or filter that is used to rerank/refilter the search results. Currently, random (randomly reorder the returned results), best (reorder by best match of query with result title and description), and bestrandom (reorder by best match, supplemented with random results if less than 10 matching results are found) are supported. All other strings are treated as best . |
"signature" | Used to sign API requests. Takes a function and optionally a private parameter. Possible value: HmacSHA256({key}) (used for Amazon's affiliate program). |
"testquery" | A query for which a search gives a non-empty result. If not set, the system should give a non-empty result for the query searsia . |
"usertemplate" | A url specified following the Opensearch url template syntax, to be used by users. The mime type of this url must be text/html or application/xml+xhtml . |
* mandatory fields
If your error is: SLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
:
Your Java version does not have the right SSL certificate to connect to the site.
Updating Java might solve this problem. If not,
get the certificate in Firefox by clicking on the lock in the browser address bar.
Choose "More Information", "View Certificate", "General Details", "Export" and
save the certificate (for instance as MySite.crt).
Then, on your machine, locate your java certificate files; On Ubuntu 18.04, they
can be found here: /usr/lib/jvm/default-java/lib/security/cacerts
.
Then update your certificates as follows:
sudo keytool -import -noprompt -trustcacerts -alias MySite -file MySite.crt -keystore /usr/lib/jvm/default-java/lib/security/cacerts
(use the password: changeit)