Build Your Own Node.js Search Engine for Github Wikis
In this article, we're going to learn how to create a knowledge base wiki in Github and then use Node.js to search through the pages and content of the wiki. The goal is to make it easy for people to find information in the wiki related to whatever their query is. We are
essentially building a search engine for the wiki which will make the
knowledge contained in the wiki easier to find.
Table of Contents
- Background
- Why do we need a search engine for Github wikis?
- Building the Search Engine
- Pre-requisites
- Downloading the Github Wiki
- indexer.js: Indexing the Github wiki contents
- The modules and libraries we need
- Setting the wiki URL with command-line arguments
- Walking through each Markdown file
- Turning each file into search data
- Tagging the content
- Parsing Markdown into a tree
- Grouping the content into sections
- Convert the content into search data
- The Completed Indexer and Call Graphs
- search.js: Searching through the database
- Where To Go From Here
Background
What is a Wiki?
One of the best ways to create and organize a knowledge base is through a wiki. Whenever you work on a software development project, whether it's proprietary or free/open source, you will want to build a knowledge base to document the reasons why the project was undertaken, how it's supposed to work, and any other additional useful information.
The power of the wiki is that it can evolve organically. There is no hierarchy until you start organizing knowledge in that way. If you don't see some piece of useful information, create a new wiki page or edit an existing one and add that useful knowledge. Wikis are free-form and evolve in an organic way which makes them immensely flexible and more of a living document in comparison.
The first wiki was WikiWikiWeb (which runs the c2 wiki), and Wikipedia is one of the largest wikis in existence. There are a lot of other uses of wikis and they have been found to be example places to organize knowledge.
Github Wiki As Knowledge Base
Github is one of the largest sites for hosting the source code of projects. It supplies an issue tracker and it also supplies a wiki that can function as a knowledge base.
Click here to learn how to create a wiki in Github.
The Github wiki is based on a flat list of Markdown files in a Git repository. That means the wiki has built-in version control and has an easy-to-use formatting language.
What is a Search Engine?
A search engine is given a search query and then searches through a
list of documents, files, or any other data for anything that is
related to that query. For example, when you search for "cars" in
Google, you will see search results for car dealerships and pictures
of cars.
There are two parts to a search engine:
- The indexer creates an index of content to search through
- The searcher searches through the content based on a query
Why do we need a search engine for Github wikis?
Github doesn't really provide a search engine of its own for wikis (unless, of course, you like downloading the wiki Markdown files and using grep
to search through them!). This limits the usefulness of the wiki as a knowledge organization tool and will limit its readers. If you can't find useful information quickly then what's the point of documenting that information in the first place?
Our search engine for Markdown-based wikis like the Github wiki aims to make the wiki more accessible and useful and to turn it into a quick reference.
Building the Search Engine
We're going to take a top-down approach to building the search
engine. We already know what a search consists of (an indexer and
searcher) and we know that we're going to be converting Markdown files into a database that can be searched through.
Pre-requisites
To begin building a search engine, we're going to need a few things:
- Node.js and NPM (Node Packager Manager)
- Node packages:
- commonmark, a parser for Markdown, used for parsing each Markdown file so that we can index the right text whether it's a list item or a heading or a paragraph or a link.
- stemmer, for finding the "stem" of a word (for example, the stem of "searching" is "search"), used for turning words into keywords for the search engine.
- walk, for walking through a directory tree and list of files, used for finding all Markdown files in the wiki directory.
- The Markdown files from the Github wiki.
The finished code for this project is available here: https://gitlab.com/rudolfo/librarian/tree/master
Downloading the Github Wiki
You can download a copy of your Github wiki by cloning it like this on the command-line:
git clone --depth 1 ssh://git@github.com/${USER}/${REPO}.wiki.git wiki
As an example, you can download the learning-perl6
wiki that I've set up:
git clone --depth 1 ssh://git@github.com/rudolfolah/learning-perl6.wiki.git wiki
This will clone the wiki into the wiki
directory and you will see a
few Markdown files.
indexer.js: Indexing the Github wiki contents
Now that we have the Markdown files of the wiki, we can start writing
the indexer.
The basic algorithm we are implementing is this:
- For each file in the wiki directory, we'll know if it is a Markdown file or not by checking its extension.
- If it is a Markdown file then process the file:
- Process the title by breaking it into tags.
- Process the content by walking through each node and grouping it by page heading and sub-headings:
- If a node is a heading, store it as the current heading to break the content down by sections (this is needed so we can link to the heading that's nearest to the content that was found).
- If a node is a literal (such as a paragraph, link, or list item), we add it to the current section.
- Convert the grouped content into indexed data for the database.
- Generate an ID to uniquely identify the data based on the content.
- Tag the content with keywords extracted from the sub-heading and the content itself.
- Create the URL that links to the nearest heading for the content
- Save the database.
The modules and libraries we need
Let's begin with all the modules we need to import.
We need to be able to work with files and directories, which are covered by the fs
, path
, and walk
modules. Walk will let us walk through a directory tree, making it convenient to index each Markdown file.
const fs = require('fs');
const path = require('path');
const walk = require('walk');
Next, we need to be able to parse a Markdown file into different parts like paragraph, heading, list item, and link. The CommonMark JavaScript library does this for us and lets us walk through each node in the tree that makes up a Markdown file.
const commonmark = require('commonmark');
The next thing we need is stemmer
which is:
const stemmer = require('stemmer');
The last module we need to import is crypto
, we want to generate a hash for each piece of content that we index from the wiki pages. This lets us uniquely identify each piece of content which makes it possible to, later on, apply relevancy scoring methods.
const crypto = require('crypto');
Setting the wiki URL with command-line arguments
As mentioned in the algorithm steps, we are going to be creating a link to the wiki. We cannot extract the URL of the wiki from the Markdown content so it will need to be supplied through the command-line arguments.
const wikiUrlPrefix = process.argv[2];
Walking through each Markdown file
We're going to use the walk
module to create a walker, which will
walk through each Markdown file in the wiki directory, wikiDir
. We will create an empty hash table to store the indexed data, in other words, it's the database of searchable content that we're building up:
const wikiDir = 'wiki/';
const walker = walk.walk('wiki/');
let index = Object.create(null);
For each file in the wikiDir
, we are going to check if it ends with
the file extension .md
, which will tell us whether or not it is a
Markdown-formatted file (this is a common convention that is used).
If this is the case, we read in all of the content of the file using
fs.readFileSync. Then we store the results of processing the contents using the processFile
function.
walker.on('file', function(root, fileStats, next) {
const fileName = fileStats.name;
if (fileName.indexOf('.md') !== -1) {
const pathName = path.join(wikiDir, fileName);
const content = fs.readFileSync(pathName).toString();
index[fileName] = processFile(fileName, content);
}
next();
});
We call next()
to move on to the next file in the wiki directory to check.
We also need to handle any errors that show up while walking through the files:
walker.on('errors', function(root, nodeStatsArray, next) {
next();
});
And when we have finished walking through all the files, we loop through the index that we have built, converting it into a list, and
print out our database as a JSON file. We can load this JSON file into the searcher later on.
walker.on('end', function() {
let result = [];
for (var fileName in index) {
for (var i = 0; i < index[fileName].length; i += 1) {
result.push(index[fileName][i]);
}
}
console.log(JSON.stringify(result));
});
Turning each file into search data
Continuing along with our top-down programming approach, we're going to implement the function that processes each file. The two arguments supplied are the name of the file and its contents. We're going to store a list of all the content we parse and tag for the database.
function processFile(fileName, content) {
let result = [];
Every piece of content must be associated with its nearest heading. It's possible that a Markdown wiki page will not have any heading and will only have a title (its filename) — this lets us associate all of the text on the page with the title acting as the first heading.
We would also like to break down the title of the wiki page into
keyword tags. Every piece of text we extract from this wiki page will
start off with these same tags. For example, if the title of the page
is "Hello World", then the tags are ["hello", "world"]
.
const title = fileName.replace('.md', '');
const tags = breakIntoTags(title);
Then we need to parse the file's content, which is a Markdown-formatted string at this point, into a tree of nodes. Each node will let us know what type of Markdown formatting it is, such as whether it's a link or paragraph. We encapsulate this into a function so that we can choose a different parser other than commonmark later on if we find that the implementation is too slow.
const tree = contentToMarkdownTree(content);
Now we process the tree of Markdown nodes. Again, because we are using a top-down approach, we know the algorithm we want to implement so we can encapsulate each step into a separate function. For each section (based on the headings found) of the content, we break down the section into a set of keyword tags.
const processedContent = processContent(title, tree);
for (var heading in processedContent) {
const headingTags = breakIntoTags(heading);
Then for each paragraph or link, or list item in the section, we want to convert the item to search data. The search data will require the title of the file, the section heading that the item belongs to, the initial set of keyword tags it will be tagged with, and the item's text itself.
for (var i = 0; i < processedContent[heading].length; i += 1) {
const item = processedContent[heading][i];
const data = convertToSearchData(title, heading, tags, item);
After converting to search data, we insert the data into our growing
search database and then return the result:
result.push(data);
}
}
return result;
}
Tagging the content
When we break down a title or a section's contents into keyword tags, we need to clean them up by removing any non-alphabetic characters. We also need to lower-case the characters so the tags are consistent. The stemmer will shorten words to their stem, for example "searching" becomes "search."
function breakIntoTags(text) {
let clean = text.replace(/[^a-zA-Z]/g, ' ');
clean = clean.toLowerCase();
clean = clean.split(' ');
let tagsHash = Object.create(null);
for (var i = 0; i < clean.length; i += 1) {
if (shouldIgnoreWord(clean[i])) {
continue;
}
const stemmed = stemmer(clean[i]);
tagsHash[stemmed] = true;
tagsHash[clean[i]] = true;
}
let tags = [];
for (var key in tagsHash) {
if (key.length > 0) {
tags.push(key);
}
}
return tags;
}
We also use the shouldIgnoreWord
function to see if we should avoid treating a word as a valid tag, things like "the" and "on" and "an"
and single-characters like "a" or "e" are not valid tags because they occur frequently and are not relevant to the content.
function shouldIgnoreWord(text) {
const stopWords = ['the', 'on', 'for', 'up', 'an', "'", 'to'];
return text.length === 1 || stopWords.indexOf(text) !== -1;
}
Parsing Markdown into a tree
To convert the file content into a tree of Markdown nodes, we just create a new parser using commonmark and then return the results of calling the parse function:
function contentToMarkdownTree(content) {
const reader = new commonmark.Parser();
return reader.parse(content);
}
Grouping the content into sections
Now let's dive into how we're processing the content based on the Markdown tree. We first need to create a tree walker and keep track of the current heading and text. We also need to keep track of the sections we have already collected.
function processContent(title, tree) {
const walker = tree.walker();
let event, node, child;
let currentHeading = null;
let currentText = null;
let sections = {null: []};
We walk through each node in the Markdown tree and check if it's a heading or if it represents a literal value (such as a paragraph, a
link or list item). When the node is a heading, we use a helper function, getNodeChildrenText
, to clean up the text of the heading
and store it as the current heading. If it's a literal, we group it into the current section (after some clean up, replacing newlines with spaces and making the paragraph lower-cased):
while ((event = walker.next())) {
node = event.node;
if (node.type === 'heading') {
currentHeading = getNodeChildrenText(node);
} else if (node.literal) {
const text = node.literal.replace('\n', ' ').toLowerCase();
if (sections[currentHeading]) {
sections[currentHeading].push(text);
} else {
sections[currentHeading] = [text];
}
}
}
Finally, after all the sections have been grouped, we assign the title
of the wiki page as the heading for any content that didn't have a
heading and return the grouped sections:
sections[title] = sections[null];
delete sections[null];
return sections;
}
And here's the helper function, getNodeChildrenText
for creating a
heading properly:
function getNodeChildrenText(node) {
let text = '';
child = node.firstChild;
do {
text += child.literal;
} while ((child = child.next));
return text;
}
Convert the content into search data
At this point, to recap, we are looping through each Markdown file, extracting the title and the content of the wiki page. We have grouped each piece of content into sections under the headings that are in the wiki page. We have also built up a partial list of keyword tags based on the section heading.
Now it's time to convert each piece of content into search data.
The data we need are:
- A unique ID to identify the content
- The title of the page this content belongs to (example: "Wiki Page")
- The URL to the wiki page (example: "Wiki-Page.html")
- The heading of the section that the content belongs to (example: "Some Section")
- The URL to the wiki page and deep-linked to the section (example: "Wiki-Page.html#Some-Section")
- The content itself
- The keyword tags that are related to the content and the heading
function convertToSearchData(title, heading, tags, item) {
const subheadingUrl = heading.replace(/\s+/g, '-').replace(/[\/()]/g, '').toLowerCase();
const id = generateId(title, heading, item.content);
const titleUrl = `${wikiUrlPrefix}/${title.replace(' ', '-')}`;
let headingUrlSuffix = heading.toLowerCase().replace(/[\/\(\),.]/g, '').replace(/ /g, '-');
return {
id: id,
title: title,
title_url: titleUrl,
heading: heading,
heading_url: title == heading ? titleUrl : `${titleUrl}#${headingUrlSuffix}`,
content: item,
tags: tags.concat(breakIntoTags(item)).concat(headingTags)
};
}
When we write the searcher that will execute search queries, we will have enough data to search through the tags and the content to see if it is relevant and then we can retrieve the URL of the page where that content exists.
This is the helper function generating the unique ID, it relies on the
crypto.createHash function and uses the SHA-256 algorithm to all of the arguments that have been supplied to it:
function generateId() {
const hash = crypto.createHash('sha256');
hash.update.apply(hash, arguments);
return hash.digest('hex');
}
The Completed Indexer and Call Graphs
We used a top-down approach to the problem of indexing wiki pages because the algorithm is very clear and straight-forward. We don't have to worry about user input, we don't have to do anything fancy, we just need to go through it step-by-step and index the content that we find and tag it with the relevant keywords and tags.
Based on all the code above, here is a call graph that shows the dependencies between functions:
The annotations I've included in the diagram map to the top-down algorithm we implemented, and match the sections we just covered (compare the titles of each section to the annotations).
Here is another version of the call graph that includes imported modules like fs
, crypto
, and commonmark
:
Click here to view the full call graph
Call graphs are a great tool to help understand a code base whether it's your own code that you're reading or someone else's. They outline which functions are called by other functions. They can be used to conduct traces of values as well, like a debugger. What I like is that they provide an overview of the scope of a program.
search.js: Searching through the database
Now that we have generated a database index of all our wiki pages'
content, we can start searching through it. This is very easy to do
since we've already tagged the content with keywords. To keep things
simple, we will do a single keyword search for relevant content. This
basic search is also great for debugging the search index because you
can see which results are showing up or not and then you can tune
indexer.js
to add better keyword tags.
First we use a command-line argument to supply the search keyword:
const query = process.argv[2].replace('-', '').toLowerCase();
Then we load up the database generated by indexer.js, I saved the output from it to a file called whatever.json
:
const data = require('./whatever.json');
We're taking advantage of the fact that Node.js can load up JSON files
without explicitly creating a JSON parser.
Next we start to build up the search results by looping through all of
the data in the search database:
let result = {};
for (var i = 0; i < data.length; i += 1) {
const item = data[i];
If we find the search keyword in the list of tags for the content, we
include it in the search results:
if (item.tags.indexOf(query) !== -1) {
result[item.id] = item;
console.log('found in tags: ' + item.tags);
If we find the search keyword in the actual content, we include it in the search results as well:
} else if (item.content.indexOf(query) !== -1) {
result[item.id] = item;
console.log('found in content: ' + item.content);
}
}
Afterward, we just iterate through all of the search results and print
out the page title, the nearest heading, and the URL to the wiki page that the search result is found on, along with the content that was found in the database:
console.log('#########');
for (var id in result) {
const item = result[id];
console.log(`${item.title} - ${item.heading}`);
console.log(`Link: ${item.heading_url}`);
console.log();
console.log(item.content);
console.log('@@@@@@@@@');
}
Where To Go From Here
Now that you have a search engine, you build more interfaces to access the database. One cool example is a Slack/HipChat chatbot where you can ask the bot a question, then it will turn the question into a query, search for the answer in the Github wiki and return a link to it.
This is something that I worked on with a coworker, Suhail Dawood at an internal company hackathon (which Universe.com holds monthly). We made it possible to search through the company's wiki and the Slack chatbot used a classifier to make the search interface smarter. The chatbot saw immediate use and has been useful in answering questions like "what's the office wifi password?" and "where are the server logs?"
Another example is building wiki-bots. Wikipedia users employ wiki-bots to automate the cleanup of certain pages or to find pages that have broken links. Since Github wiki pages don't have a table of contents, you could create a wiki-bot that scans through a page for headings and creates a table of contents and inserts it into the page.
A search engine makes a knowledge base far more useful, a wiki-bot
makes the wiki more accessible.
ssh://git@github.com/rudolfolah/learning-perl6.wiki.git
is not available anymore. Could you look at it?
Similarly I can’t access:
https://gitlab.com/rudolfo/librarian/tree/master
Thank you!