Manual (Version 2.2)
http://www.heathcosoft.com/webproducts/search
Contents:
1. Installing and setting up the search
engine
1.3 Indexing
from the command line
1.4 Preventing indexing of specific pages/portions
of pages
1.5 Adding/deleting individual URLs without
re-indexing entire site
2. Incorporating the search into your
site
3. Customizing the search results page
3.1 Editing the results style sheet
3.2 Editing the results template
4. General search engine settings
4.1 Indexer
settings
5. Cronjobs
(automating the index process)
Once you have downloaded the search engine package file, the
only file you will need to edit is the settings.php file. In this file,
you will see:
$SETTING["password"] =
"password";
$SETTING["db_login"] =
"database username";
$SETTING["db_password"] =
"database password";
$SETTING["db_name"] =
"database name";
$SETTING["db_prefix"] =
"table prefix";
$SETTING["regname"] =
"registration name";
$SETTING["regkey"] =
"registration key";
The "password" should be changed to the password you want to use to login to the search engine when you will be configuring searches, etc (do not confuse this with the database password, explained in the next sentence). The "database username" and "database password" should be changed to the login and password used for connecting to MySQL. The "database name" is the name of the database to use in MySQL, and the "table prefix" is the prefix to use for any tables that the search engine creates.
The registration name and key can be taken directly from the verification email you receive after you order the search engine. This will allow you to download upgrades to the search engine directly from a link within it (the Upgrade link).
Once you have updated the settings file, upload all the
files to the directory of your choice on your web server. You may then begin
using the search engine by accessing index.php in the folder in which
you uploaded the files.
The search engine allows you to create many different
searches that can each be setup to index different sections of your web site.
Click on Create New Search from the link area. All you need to create a
new search is to give it a unique name. Once your search has been created, you
can begin to set it up by adding URLs for it to index.
The URL is the location of the web page you want to begin
indexing. Since the search engine is a
spider, it can follow links on the web page and recursively index other web
pages.
Max URL(s)
This field specifies the maximum number of pages to index
when spidering this URL. If you just
want to index the one page alone, specify 1.
Otherwise, the engine will stop fetching links and indexing them after
reaching the maximum number specified.
If checked, the URL you specify will NOT be indexed. This will only avoid indexing the single URL
you specify, so entering http://www.example.com
will only avoid that single URL (and not subpages/subdirectories). Read up on regular expressions for this
feature and other more powerful denials.
Entering a regular expression URL gives you the most power. If you do not know how to enter a regular
expression, you can read up at Perl.com. Here are some common regular expressions:
|
Deny an entire domain |
/http:\/\/www\.example\.com\/.*/ |
|
Deny all txt files |
/http:\/\/www\.example\.com\/.*\.html/ |
|
Deny images subdirectory |
/http:\/\/www\.example\.com\/images\/.*/ |
|
Only index articles directory and it’s
subdirectories |
/http:\/\/www\.example\.com\/(?!articles)\/.*/ |
If checked, the spider will not leave the domain specified in the URL. For example, if you index http://www.example.com and it contains a link to http://www.example2.com, it will NOT be indexed. It will only index pages under http://www.example.com.
Once you have added at least one URL to the Search URLs,
you can index them. Indexing them will
spider links, read them in, and organize them into the database. This can take from a few minutes to a few
hours, depending on what you specified for Max URL(s) and if the files
are on your local server or not. There
are two ways in which you can index:
If you choose this, only URLs that have been modified since the last indexing will be re-indexed. This is usually much faster than indexing all URLs, especially if the URLs consist mainly of static pages. The first time you index a search, however, it will end up indexing all URLs anyhow.
If you choose this, the index database will be cleared and all URLs will be completely re-indexed. This is what happens the first time you create a search. You will usually want to re-index only modified URLs after your first index.
You can also index from the command line (especially useful
for cronjobs and indexing on a regular basis).
Read up on cronjobs for
more info.
There are a few methods in which you can completely avoid
indexing specific pages. The first is
by using the Deny This URL feature described above. You can also use the meta noindex and
nofollow tags as described here: http://www.robotstxt.org/wc/meta-user.html.
You may also use the robots.txt file exclusion as described here: http://www.robotstxt.org/wc/exclusion-admin.html.
If you want to index a page, but exclude a portion of the page from indexing,
you can include the following tags around the area you want excluded:
<!-- noindex -->stuff you don’t want included here<!-- endnoindex -->
You might want to add individual URLs to the search that you
left out or that the engine skipped over when indexing. One way is to simply add the URL and then to
re-index. If you want to only add a few
individual URLs, you can type each URL into the Add Individual URLs
section.
On the other hand, you may want to review all of the URLs that have been
indexed and selectively remove unwanted ones from the database without
re-indexing. You can view this list and
delete URLs by clicking on link displaying the number of URLs that have been
indexed (this link is found among the links below the main menu). When you delete a URL, it will automatically
be added to the deny list to prevent it from being scanned again on the next
index.
It is very simple to put the search engine into your web
pages. In the page where you include
the text box to enter the search query, you can use <form> code similar
to the following:
<form
action=”your_search_page.php” method=”get”>
<input type=”hidden”
name=”name” value=”example_search” />
<input type=”hidden”
name=”results_per_page” value=”10” />
<input type=”text”
name=”query” />
<input type=”submit”
value=”Search”>
</form>
In this snippet of code, there are some things you should note:
If you are using your own search page to submit the form to
(in which case the file should be a PHP file), you will need to place the
following snippet of code in order to conduct the search:
<?php include
/path/to/search/folder/search.php" ?>
Just make sure that the path to search.php within
your search folder is correct. You can
include any other HTML headers and footers above and below the code, but some
of your variables in PHP scripts might be overwritten (this usually won’t be a
problem).
You can customize almost every aspect of the search results
page that will be displayed after users execute a search query. To get to the
customization page, select Customize Look of Search Results Page from
the edit search page.
Editing the results style sheet is the easiest way to
customize the appearance of various parts of the search results page. If you
need help with CSS style sheet syntax, you can visit http://www.htmlhelp.com for HTMLHelp’s
Style Sheet Guide.
search_header is the style of the header
information displaying the current results at the top of the results page.
search_result_title is the style of the clickable
link that will contain the title of each result.
search_result_content is the style of the
content-portion of the web page that contains the information relevant to the
search.
search_result_url is the style of the URL that will
be displayed at the end of each result.
search_result_description is the style of the meta-description if available.
search_result_highlight is the style that will be
applied to any of the search keywords that exist in the result title, content,
or description.
search_footer is the style of the links that
will go to the other results pages (ie previous page, next page, 5th page...).
search_time is the style of the displayed
search execution time.
search_copyright is the style of the search engine
copyright displayed at the bottom of the results page. It can be removed by
editing the results template below.
If you want even more control over the appearance of the
search results page, you can edit the results template. Please note that this is much more difficult
than editing the styles above. If any of your changes happen to cause problems
that you are unable to fix, you can revert to the original style sheet and
template by clicking on Restore Original Style Sheet and Template.
The default template is shown below with comments shown in
green:
IF THERE ARE
RESULTS FOUND FOR THIS SEARCH...
<!-- IF [results]
-->
<div
class="search_header">
Displaying
{FIRSTRESULT}-{LASTRESULT} of {TOTALRESULTS} result(s)</div>
<br />
START
DISPLAYING EACH RESULT...
<!-- START results
-->
<div
class="search_result_title">
<a href="{results.URL}">{results.TITLE}</a>
</div>
<!-- IF
[results.DESCRIPTION] -->
<div
class="search_result_description">
{results.DESCRIPTION}
</div>
<!-- ENDIF -->
<div
class="search_result_content">
{results.CONTENT}
</div>
<div
class="search_result_url">
{results.URL}
</div>
<br />
<!-- END results -->
IF THERE IS
MORE THAN ONE PAGE OF RESULTS...
<!-- IF [pages] -->
<div
class="search_footer">
Results Page:
<!-- IF isset(
[PREVIOUSPAGE] ) -->
<a
href="{page->url}?id={ID}&query={URLQUERY}&offset={PREVIOUSPAGE}&results_per_page={RESULTSPERPAGE}"><<
Previous</a>
<!-- ENDIF -->
START
DISPLAYING LINKS TO EACH PAGE OF RESULTS...
<!-- START pages -->
<!-- IF [pages.PAGE] ==
[CURRENTPAGE] -->
{pages.PAGE}
<!-- ELSE -->
<a
href="{page->url}?id={ID}&query={URLQUERY}&offset={pages.OFFSET}&results_per_page={RESULTSPERPAGE}">{pages.PAGE}</a>
<!-- ENDIF -->
<!-- END pages -->
<!-- IF isset(
[NEXTPAGE] ) -->
<a
href="{page->url}?id={ID}&query={URLQUERY}&offset={NEXTPAGE}&results_per_page={RESULTSPERPAGE}">Next
>></a>
<!-- ENDIF -->
</div>
<!-- ENDIF-->
<br />
<div
class="search_time">
Search took
{EXECUTIONTIME} seconds
</div>
<!-- ELSE -->
No results were found
<!-- ENDIF -->
DISPLAY THE
COPYRIGHT (can be removed if you wish)
<div
class="search_copyright">Powered by <a
href="http://www.heathcosoft.com/webproducts/searchsite/">Heathco
Search Engine</a></div>
All of the <!-- (…) --> are not in fact HTML comments like you might be familiar with,
but syntax that the template engine will pick up, telling it to do ‘for’ loops,
spit out variables, etc.
You can access the general settings through the Settings on
the menu. This allows you to configure
various aspects of the indexer to how relevancy is determined for results.
URLs Per Index |
This is the number of URLs to index at a time if you are
indexing thru the web interface (not thru the command line). A higher value might speed up
indexing. If it is set too high, the
indexer might timeout. A good range
for this is between 20-30. |
Timeout |
The amount of time the indexer will spend trying to
connect to a URL. A good range is
between 5-30 seconds. |
Max Read Size (Bytes) |
The number of bytes to download from each individual URL
when indexing. This might not be the
same as the amount of bytes stored in the database, but determines how much
of a page will be parsed for links. |
Max Store Size (Bytes) |
The number of bytes from each URL to store in the database. This is the searchable data (a larger
value will slow searches down, but possibly improve result accuracy). |
Stop Words |
These are words that are ignored from searches. They are common words that will usually
not improve search results, but rather degrade performance. |
Allowed Content Types |
These are the content types that will be allowed for
indexing. text/html and text/plain
are the two standard types that almost all web pages consist of. Using these content types allows more
control over what gets indexed, preventing the engine from downloading
images, etc. |
When results are returned, they are return in order of relevancy (how relevant a page is to your search). Relevancy is determined in this search engine by the page title, meta description, meta keywords, and page content. By increasing/decreasing relevancy factors, you can easily change the order in which results are returned. The default settings seem to work out best for most sites, but you might want to change these factors if your page titles are all the same, etc.
Each result has a title, description, and content information. These are trimmed down from the original page so that multiple results can be shown at a time. By adjusting these character settings (a character is a letter/number/etc.), you can change how much information is displayed for the title and so forth.
You can run the indexer from the command line (and
ultimately allow indexing to be completed by a cronjob) thru the use of the search_index_cmd.php
file. Note that you must be able to
execute php from the command line to do this (please read the online PHP
documentation, Heathco Software does not support setting this up).
To run the indexer from the command line, use the following
code.
php search_index_cmd.php
–id [search id#] –name [search name] -all
The search id# is the ID number of the search you want to index. The search name is the name of the search you want to index. You can use one or the other, depending on your preference (don’t use both, as it is pointless). If you omit the -id and the -name, then ALL searches will be indexed.
Specifying -all, tells the indexer to re-index ALL URLs. If you leave this out (which you usually should), it will only index URLs that have been modified since the last index.
By using this command line, you can setup a cronjob to do the indexing task on a daily or weekly basis (or on whatever interval you wish).