Site search

The approach used to search static sites generally is either to use an external search engine (e.g. Google) or to use javascript to search a index file built at generation time. Pagegen will, if configured, build a search index. The index is JSON and may be searched by the client using javascript.

Note

Site search is generally something that should be customized to each site as such Pagegen provides the building blocks, but some customization must be done.

Index generation

The index is created by searching the generated files for relevant content and is affected by the following settings in site.conf.

The page title tag and meta description tag are indexed for search, in addition to the following HTML tags.

Terms are weighted according to the tag they are found in, e.g. a term in a heading is weighted higher than the same term found in a paragraph. Search results are sorted by weight so that pages containing the search term in a title tag will be ranked higher in the search results than a page that only has the search term in in a paragraph.

Stop words

Many terms are not interesting to index, for instance the, an, it etc, these terms are generally referred to as stop words. Add stop words to stopwords.txt to keep them out of the index.

Index structure

The search index is generated to /search-index.json. It consists of two main sections, terms and urls. The terms section lists each term and a reference to the urls section indicating which page the term is found in. The urls section lists the pages that have been indexed with their title and description.

{
  "terms": {
    "hooks": [
      2,
      1
    ],
    "forbidden": [
      2
    ],
  },
  "urls": {
    "1": [
      "/user-manual/index.html",
      "User manual",
      "Pagegen user manual covers site installation, content management, site generation, design and layout, hooks and web server configuration tips."
    ],
    "2": [
      "/user-manual/hooks.html",
      "Hooks",
      "Hooks are user customizable executables that may be run at specific points during the site generation process."
    ]
  }

Note

The urls order of reference for each term is based on the terms weighting. In the example above for the term hooks exists in two pages and the array ordering is relevant, the reference numbers refer to the pages defined in the urls section. Hooks is ranked more significant than the User manual page.

In the urls section the url array contains three items: url, title and page description.

Implementing site search

To implement search a search form allows the user to enter and submit a search query. The query must be parsed and compared to the index structure and the search result returned to the user.

A search implementation is bundled with Pagegen. It consists of a regular html form that on submit makes a get request. The get request is handled by a page where the terms are searched for in the index structure using pure Javascript.

Search form

<form action="/search.html" method="GET">
  <input type="text" id="search-query" name="q" />
  <input type="submit" id="search-submit" value="Search" />
</form>

Search page

The search page basically gets the search query (query string parameter q), then fetches search-index.json and looks to see if any matches exist, if they do it returns the search results.