HEATHCO SEARCH ENGINE

Manual (Version 2.2)

 

http://www.heathcosoft.com/webproducts/search

 

Contents:

 

1. Installing and setting up the search engine

1.1 Creating a new search

1.2 Indexing the search

1.3 Indexing from the command line
            1.4 Preventing indexing of specific pages/portions of pages
            1.5 Adding/deleting individual URLs without re-indexing entire site

2. Incorporating the search into your site

3. Customizing the search results page

            3.1 Editing the results style sheet

            3.2 Editing the results template

4. General search engine settings

4.1 Indexer settings

4.2 Relevancy settings

4.3 Character settings

5. Cronjobs (automating the index process)

 


1. INSTALLING AND SETTING UP THE SEARCH ENGINE

 

 

Once you have downloaded the search engine package file, the only file you will need to edit is the settings.php file. In this file, you will see:

 

$SETTING["password"]              = "password";

$SETTING["db_login"]              = "database username";

$SETTING["db_password"]           = "database password";

$SETTING["db_name"]               = "database name";

$SETTING["db_prefix"]             = "table prefix";

$SETTING["regname"]               = "registration name";

$SETTING["regkey"]                = "registration key";

 

The "password" should be changed to the password you want to use to login to the search engine when you will be configuring searches, etc (do not confuse this with the database password, explained in the next sentence). The "database username" and "database password" should be changed to the login and password used for connecting to MySQL. The "database name" is the name of the database to use in MySQL, and the "table prefix" is the prefix to use for any tables that the search engine creates.

 

The registration name and key can be taken directly from the verification email you receive after you order the search engine.  This will allow you to download upgrades to the search engine directly from a link within it (the Upgrade link).

 

Once you have updated the settings file, upload all the files to the directory of your choice on your web server. You may then begin using the search engine by accessing index.php in the folder in which you uploaded the files.

 

1.1 CREATING A NEW SEARCH

 

The search engine allows you to create many different searches that can each be setup to index different sections of your web site. Click on Create New Search from the link area. All you need to create a new search is to give it a unique name. Once your search has been created, you can begin to set it up by adding URLs for it to index.

 

URL

 

The URL is the location of the web page you want to begin indexing.  Since the search engine is a spider, it can follow links on the web page and recursively index other web pages.

 

Max URL(s)

 

This field specifies the maximum number of pages to index when spidering this URL.  If you just want to index the one page alone, specify 1.  Otherwise, the engine will stop fetching links and indexing them after reaching the maximum number specified.

 

Deny This Single URL

 

If checked, the URL you specify will NOT be indexed.  This will only avoid indexing the single URL you specify, so entering http://www.example.com will only avoid that single URL (and not subpages/subdirectories).  Read up on regular expressions for this feature and other more powerful denials.

 

Deny URL(s) as a Regular Expression

 

Entering a regular expression URL gives you the most power.  If you do not know how to enter a regular expression, you can read up at Perl.com.  Here are some common regular expressions:

Deny an entire domain

/http:\/\/www\.example\.com\/.*/

Deny all txt files

/http:\/\/www\.example\.com\/.*\.html/

Deny images subdirectory

/http:\/\/www\.example\.com\/images\/.*/

Only index articles directory and it’s subdirectories

/http:\/\/www\.example\.com\/(?!articles)\/.*/

 

Restrict to Indexing Within Same Domain

 

If checked, the spider will not leave the domain specified in the URL.  For example, if you index http://www.example.com and it contains a link to http://www.example2.com, it will NOT be indexed.  It will only index pages under http://www.example.com.

 

1.2 INDEXING THE SEARCH

 

Once you have added at least one URL to the Search URLs, you can index them.  Indexing them will spider links, read them in, and organize them into the database.  This can take from a few minutes to a few hours, depending on what you specified for Max URL(s) and if the files are on your local server or not.  There are two ways in which you can index:

 

Index Modified URLs

 

If you choose this, only URLs that have been modified since the last indexing will be re-indexed.  This is usually much faster than indexing all URLs, especially if the URLs consist mainly of static pages.  The first time you index a search, however, it will end up indexing all URLs anyhow.

 

Index All URLs

 

If you choose this, the index database will be cleared and all URLs will be completely re-indexed.  This is what happens the first time you create a search.  You will usually want to re-index only modified URLs after your first index.

 

1.3 INDEXING FROM THE COMMAND LINE

 

You can also index from the command line (especially useful for cronjobs and indexing on a regular basis).  Read up on cronjobs for more info.


1.4 PREVENTING INDEXING OF SPECIFIC PAGES/PORTIONS OF PAGES

 

There are a few methods in which you can completely avoid indexing specific pages.  The first is by using the Deny This URL feature described above.  You can also use the meta noindex and nofollow tags as described here: http://www.robotstxt.org/wc/meta-user.html.

You may also use the robots.txt file exclusion as described here: http://www.robotstxt.org/wc/exclusion-admin.html.

If you want to index a page, but exclude a portion of the page from indexing, you can include the following tags around the area you want excluded:

<!-- noindex -->stuff you don’t want included here<!-- endnoindex -->


1.5 ADDING/DELETING INDIVIDUAL URLS AFTER INDEXING

 

You might want to add individual URLs to the search that you left out or that the engine skipped over when indexing.  One way is to simply add the URL and then to re-index.  If you want to only add a few individual URLs, you can type each URL into the Add Individual URLs section.

On the other hand, you may want to review all of the URLs that have been indexed and selectively remove unwanted ones from the database without re-indexing.  You can view this list and delete URLs by clicking on link displaying the number of URLs that have been indexed (this link is found among the links below the main menu).  When you delete a URL, it will automatically be added to the deny list to prevent it from being scanned again on the next index.


2. INCORPORATING THE SEARCH INTO YOUR SITE

 

 

It is very simple to put the search engine into your web pages.  In the page where you include the text box to enter the search query, you can use <form> code similar to the following:

 

<form action=”your_search_page.php” method=”get”>

<input type=”hidden” name=”name” value=”example_search” />

<input type=”hidden” name=”results_per_page” value=”10” />

<input type=”text” name=”query” />

<input type=”submit” value=”Search”>

</form>

 

In this snippet of code, there are some things you should note:

 

 

If you are using your own search page to submit the form to (in which case the file should be a PHP file), you will need to place the following snippet of code in order to conduct the search:

 

<?php include /path/to/search/folder/search.php" ?>

 

Just make sure that the path to search.php within your search folder is correct.  You can include any other HTML headers and footers above and below the code, but some of your variables in PHP scripts might be overwritten (this usually won’t be a problem).

 


3. CUSTOMIZING THE SEARCH RESULTS PAGE

 

 

You can customize almost every aspect of the search results page that will be displayed after users execute a search query. To get to the customization page, select Customize Look of Search Results Page from the edit search page.

 

3.1 EDITING THE RESULTS STYLE SHEET

 

Editing the results style sheet is the easiest way to customize the appearance of various parts of the search results page. If you need help with CSS style sheet syntax, you can visit http://www.htmlhelp.com for HTMLHelp’s Style Sheet Guide.

 

search_header is the style of the header information displaying the current results at the top of the results page.

 

search_result_title is the style of the clickable link that will contain the title of each result.

 

search_result_content is the style of the content-portion of the web page that contains the information relevant to the search.

 

search_result_url is the style of the URL that will be displayed at the end of each result.

 

search_result_description is the style of the meta-description if available.

 

search_result_highlight is the style that will be applied to any of the search keywords that exist in the result title, content, or description.

 

search_footer is the style of the links that will go to the other results pages (ie previous page, next page, 5th page...).

 

search_time is the style of the displayed search execution time.

 

search_copyright is the style of the search engine copyright displayed at the bottom of the results page. It can be removed by editing the results template below.

 

3.2 EDITING THE RESULTS TEMPLATE (advanced users)

 

If you want even more control over the appearance of the search results page, you can edit the results template.  Please note that this is much more difficult than editing the styles above. If any of your changes happen to cause problems that you are unable to fix, you can revert to the original style sheet and template by clicking on Restore Original Style Sheet and Template.

 

The default template is shown below with comments shown in green:

 

IF THERE ARE RESULTS FOUND FOR THIS SEARCH...

<!-- IF [results] -->

 

DISPLAY THE HEADER INFORMATION [ie 'Displaying 1-10 of 22 result(s)]

<div class="search_header">

Displaying {FIRSTRESULT}-{LASTRESULT} of {TOTALRESULTS} result(s)</div>

<br />

 

START DISPLAYING EACH RESULT...

<!-- START results -->

 

DISPLAY THE TITLE OF THE RESULT

<div class="search_result_title">

<a href="{results.URL}">{results.TITLE}</a>

</div>

 

DISPLAY THE META-DESCRIPTION IF AVAILABLE

<!-- IF [results.DESCRIPTION] -->

<div class="search_result_description">

{results.DESCRIPTION}

</div>

<!-- ENDIF -->

 

DISPLAY THE CONTENT PORTION OF THIS RESULT MATCHING THE SEARCH

<div class="search_result_content">

{results.CONTENT}

</div>

 

DISPLAY THE URL OF THIS RESULT

<div class="search_result_url">

{results.URL}

</div>

 

<br />

 

END THE DISPLAYING OF EACH RESULT

<!-- END results -->

 

IF THERE IS MORE THAN ONE PAGE OF RESULTS...

<!-- IF [pages] -->

 

<div class="search_footer">

Results Page:

 

DISPLAY LINK TO PREVIOUS PAGE OF RESULTS IF AVAILABLE

<!-- IF isset( [PREVIOUSPAGE] ) -->

<a href="{page->url}?id={ID}&query={URLQUERY}&offset={PREVIOUSPAGE}&results_per_page={RESULTSPERPAGE}">&lt;&lt; Previous</a>

<!-- ENDIF -->

 

START DISPLAYING LINKS TO EACH PAGE OF RESULTS...

<!-- START pages -->

 

IF THIS PAGE IS THE CURRENT PAGE DISPLAYED, DON'T DISPLAY AS LINK

<!-- IF [pages.PAGE] == [CURRENTPAGE] -->

{pages.PAGE}

OTHERWISE, DISPLAY AS LINK

<!-- ELSE -->

<a href="{page->url}?id={ID}&query={URLQUERY}&offset={pages.OFFSET}&results_per_page={RESULTSPERPAGE}">{pages.PAGE}</a>

<!-- ENDIF -->

 

END THE DISPLAYING OF LINKS TO EACH PAGE OF RESULTS

<!-- END pages -->

 

DISPLAY LINK TO NEXT PAGE OF RESULTS IF AVAILABLE

<!-- IF isset( [NEXTPAGE] ) -->

<a href="{page->url}?id={ID}&query={URLQUERY}&offset={NEXTPAGE}&results_per_page={RESULTSPERPAGE}">Next &gt;&gt;</a>

<!-- ENDIF -->

 

</div>

 

ENDIF FOR MORE THAN ONE PAGE OF RESULTS AVAILABLE

<!-- ENDIF-->

 

<br />

 

DISPLAY THE SEARCH EXECUTION TIME

<div class="search_time">

Search took {EXECUTIONTIME} seconds

</div>

 

DISPLAY THE FOLLOWING IF NO RESULTS ARE FOUND

<!-- ELSE -->

No results were found

<!-- ENDIF -->

 

DISPLAY THE COPYRIGHT (can be removed if you wish)

<div class="search_copyright">Powered by <a href="http://www.heathcosoft.com/webproducts/searchsite/">Heathco Search Engine</a></div>

 

All of the <!-- (…) -->  are not in fact HTML comments like you might be familiar with, but syntax that the template engine will pick up, telling it to do ‘for’ loops, spit out variables, etc.

 

 


4. GENERAL SEARCH ENGINE SETTINGS

 

 

You can access the general settings through the Settings on the menu.  This allows you to configure various aspects of the indexer to how relevancy is determined for results.

 

4.1 INDEXER SETTINGS

 

URLs Per Index

This is the number of URLs to index at a time if you are indexing thru the web interface (not thru the command line).  A higher value might speed up indexing.  If it is set too high, the indexer might timeout.  A good range for this is between 20-30.

 

Timeout

The amount of time the indexer will spend trying to connect to a URL.  A good range is between 5-30 seconds.

 

Max Read Size (Bytes)

The number of bytes to download from each individual URL when indexing.  This might not be the same as the amount of bytes stored in the database, but determines how much of a page will be parsed for links.

 

Max Store Size (Bytes)

The number of bytes from each URL to store in the database.  This is the searchable data (a larger value will slow searches down, but possibly improve result accuracy).

 

Stop Words

These are words that are ignored from searches.  They are common words that will usually not improve search results, but rather degrade performance.

 

Allowed Content Types

These are the content types that will be allowed for indexing.  text/html and text/plain are the two standard types that almost all web pages consist of.  Using these content types allows more control over what gets indexed, preventing the engine from downloading images, etc.

 

 

4.2 RELEVANCY SETTINGS

 

When results are returned, they are return in order of relevancy (how relevant a page is to your search).  Relevancy is determined in this search engine by the page title, meta description, meta keywords, and page content.  By increasing/decreasing relevancy factors, you can easily change the order in which results are returned.  The default settings seem to work out best for most sites, but you might want to change these factors if your page titles are all the same, etc.

 

4.3 CHARACTER SETTINGS

 

Each result has a title, description, and content information.  These are trimmed down from the original page so that multiple results can be shown at a time.  By adjusting these character settings (a character is a letter/number/etc.), you can change how much information is displayed for the title and so forth.

 

 


5. CRONJOBS (AUTOMATING THE INDEXING PROCESS)

 

 

You can run the indexer from the command line (and ultimately allow indexing to be completed by a cronjob) thru the use of the search_index_cmd.php file.  Note that you must be able to execute php from the command line to do this (please read the online PHP documentation, Heathco Software does not support setting this up).

 

To run the indexer from the command line, use the following code.

 

php search_index_cmd.php –id [search id#] –name [search name] -all

 

The search id# is the ID number of the search you want to index.  The search name is the name of the search you want to index.  You can use one or the other, depending on your preference (don’t use both, as it is pointless).  If you omit the -id and the -name, then ALL searches will be indexed.

 

Specifying -all, tells the indexer to re-index ALL URLs.  If you leave this out (which you usually should), it will only index URLs that have been modified since the last index.

 

By using this command line, you can setup a cronjob to do the indexing task on a daily or weekly basis (or on whatever interval you wish).