Site search

The approach used to search static sites generally is either to use an external search engine (e.g. Google) or to use javascript to search a index file built at generation time. Pagegen will, if configured, build a search index. The index is JSON and may be searched by the client using javascript.

Note

Site search is generally something that should be customized to each site as such Pagegen provides the building blocks, but some customization must be done.

Index generation

The index is created by searching the generated files for relevant content and is affected by the following site configuration options.

  • include_search Set to enable indexing
  • search_xpaths One or more (comma separated) xpaths to html tags who's children will be searched for terms to index. Default is /html/body. Should be as specific as possible to the main nodes containing page specific content. I.e. for html5 perhaps index only article tags using //articles. Indexing the whole page may clutter index with terms from e.g. menu or footer sections.

The page title tag and meta description tag are indexed for search, in addition to the following HTML tags.

  • p
  • li
  • h1
  • h2
  • h3
  • h4
  • h5
  • h6
  • td
  • th
  • strong
  • em
  • i
  • b
  • a
  • blockquote
  • div
  • span
  • pre
  • abbr
  • address
  • cite
  • code
  • del
  • dfn
  • ins
  • kbd
  • q
  • samp
  • small
  • sub
  • sup
  • var
  • dt
  • dd
  • legend
  • caption
  • article
  • aside
  • details
  • figcaption
  • section
  • summary

Terms are weighted according to the tag they are found in, e.g. a term in a heading is weighted higher than the same term found in a paragraph. Search results are sorted by weight, i.e. a pages containing the term hook in it's title will be ranked higher in the search results than a page that only has it in a paragraph.

Stop words

Many terms are not interesting to index, for instance the, an, it etc, these terms are generally referred to as stop words. Add stop words to stopwords.txt to keep them out of the index.

Index structure

The search index is a JSON file located on the web root, /search-index.json. It consists of two main sections, terms and urls. The terms section lists each term and a reference to the urls section indicating which page it is found in. The urls section lists the pages that have been indexed with their title and description.

{
  "terms": {
    "hooks": [
      2,
      1
    ],
    "forbidden": [
      2
    ],
  },
  "urls": {
    "1": [
      "/user-manual/index.html",
      "User manual",
      "Pagegen user manual covers site installation, content management, site generation, design and layout, hooks and web server configuration tips."
    ],
    "2": [
      "/user-manual/hooks.html",
      "Hooks",
      "Hooks are user customizable executables that may be run at specific points during the site generation process."
    ]
  }

Note

The urls order of reference for each term is based on the terms weighting. In the example above for the term hook page 2 Hooks is ranked more significant than the User manual page.

In the urls section the url array contains three items: url, title and page description.