Description: Pagegen includes site search functionality in form of index of terms found in generated files. The search index may be used to implement search functionality. The approach used to search static sites generally is either to use an external search engine (e.g. Google) or to use javascript to search a index file built at generation time. Pagegen will, if configured, build a search index. The index is JSON and may be searched by the client using javascript. !!! Note Site search is generally something that should be customized to each site as such Pagegen provides the building blocks, but some customization must be done. ## Index generation The index is created by searching the generated files for relevant content and is affected by the following settings in `site.conf`. * `include_search=True` to enable indexing * `search_xpaths=` one or more (comma separated) xpaths to html tags who's children will be searched for terms to index. Default is `/html/body`. Should be as specific as possible to the main nodes containing page specific content. I.e. for html5 perhaps index only article tags using `//articles`. Indexing the whole page may clutter index with terms from e.g. menu or footer sections. The page title tag and meta description tag are indexed for search, in addition to the following HTML tags.

p
li
h1
h2
h3
h4
h5
h6
td
th
strong
em
i
b
a
blockquote
div
span
pre
abbr
address
cite
code
del
dfn
ins
kbd
q
samp
small
sub
sup
var
dt
dd
legend
caption
article
aside
details
figcaption
section
summary

Terms are weighted according to the tag they are found in, e.g. a term in a heading is weighted higher than the same term found in a paragraph. Search results are sorted by weight so that pages containing the search term in a title tag will be ranked higher in the search results than a page that only has the search term in in a paragraph. ## Stop words Many terms are not interesting to index, for instance the, an, it etc, these terms are generally referred to as stop words. Add stop words to `stopwords.txt` to keep them out of the index. ## Index structure The search index is generated to `/search-index.json`. It consists of two main sections, terms and urls. The terms section lists each term and a reference to the urls section indicating which page the term is found in. The urls section lists the pages that have been indexed with their title and description. { "terms": { "hooks": [ 2, 1 ], "forbidden": [ 2 ], }, "urls": { "1": [ "/user-manual/index.html", "User manual", "Pagegen user manual covers site installation, content management, site generation, design and layout, hooks and web server configuration tips." ], "2": [ "/user-manual/hooks.html", "Hooks", "Hooks are user customizable executables that may be run at specific points during the site generation process." ] } !!! Note The urls order of reference for each term is based on the terms weighting. In the example above for the term **hooks** exists in two pages and the array ordering is relevant, the reference numbers refer to the pages defined in the **urls** section. **Hooks** is ranked more significant than the **User manual** page. In the urls section the url array contains three items: url, title and page description. ## Implementing site search To implement search a search form allows the user to enter and submit a search query. The query must be parsed and compared to the index structure and the search result returned to the user. A search implementation is bundled with Pagegen. It consists of a regular html form that on submit makes a get request. The get request is handled by a page where the terms are searched for in the index structure using pure Javascript. ### Search form ### Search page The [search page](https://github.com/oliverfields/pagegen/blob/master/src/pagegen/skel/content/search.html) basically gets the search query (query string parameter `q`), then fetches `search-index.json` and looks to see if any matches exist, if they do it returns the search results.