Have you ever wanted a search engine on your site but didn't know how to do it? ~ TwinrainbO

Have you ever wanted a search engine on your site but didn't know how to do it? If so, read this article and see for yourself how Daniel Solin creates a search engine using PHP.

Database Design And Logic

The database for the search engine consist of three table: page, word and occurrence. page holds all web pages that has been indexed, and word holds all words that has been found on the indexed pages in page. The rows in table occurrence consists of references to rows in page and word. Each row representing one occurrence of one particular word on one particular page. The SQL for creating these tables are shown below.

CREATE TABLE page (
page_id int(10) unsigned NOT NULL auto_increment,
page_url varchar(200) NOT NULL default '',
PRIMARY KEY (page_id)
) TYPE=MyISAM;

CREATE TABLE word (
word_id int(10) unsigned NOT NULL auto_increment,
word_word varchar(50) NOT NULL default '',
PRIMARY KEY (word_id)
) TYPE=MyISAM;

CREATE TABLE occurrence (
occurrence_id int(10) unsigned NOT NULL auto_increment,
word_id int(10) unsigned NOT NULL default '0',
page_id int(10) unsigned NOT NULL default '0',
PRIMARY KEY (occurrence_id)
) TYPE=MyISAM;

As you see, while page and word hold actual data, occurrence act only as a reference table. By joining occurrence with page and word, one can find out on which page(s) a certain word is mentioned, as well as how many times it is mentioned. We'll get back to that a little later, however - first we need to populate the database so that we have some content to work with.

Populating The Database

Okay, the database is created and we're ready to feed it with some content. For this, we'll create a PHP-script that takes a user-specified URL, reads the document representing the URL, and creates records in the database based on the words it extracts from the document. Take a look at the listing below.

/*
* populate.php
*
* Script for populating the search-database with words,
* pages and word-occurences.
*/
/* Connect to the database: */
mysql_pconnect("localhost","root","secret")
    or die("ERROR: Could not connect to database!");
mysql_select_db("test");
/* Define the URL that sould be processed: */
$url = $_GET['url'];
if( !$url )
{
    die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
    $url = "http://$url";
}
/* Does this URL already have a record in the page-table? */
$result = mysql_query("SELECT page_id FROM page WHERE page_url = \"$url\"");
$row = mysql_fetch_array($result);
if( $row['page_id'] )
{
    /* If yes, use the old page_id: */
    $page_id = $row['page_id'];
}
else
{
    /* If not, create one: */
    mysql_query("INSERT INTO page (page_url) VALUES (\"$url\")");
    $page_id = mysql_insert_id();
}
/* Start parsing through the text, and build an index in the database: */
$fd = fopen($url,"r");
while( $buf = fgets($fd,1024) )
{
    /* Remove whitespace from beginning and end of string: */
    $buf = trim($buf);
    /* Try to remove all HTML-tags: */
    $buf = strip_tags($buf);
    $buf = ereg_replace('/&\w;/', '', $buf);
    /* Extract all words matching the regexp from the current line: */
    preg_match_all("/(\b[\w+]+\b)/",$buf,$words);
    /* Loop through all words/occurrences and insert them into the database: */
    for( $i = 0; $words[$i]; $i++ )
    {
        for( $j = 0; $words[$i][$j]; $j++ )
        {
            /* Does the current word already have a record in the word-table? */
            $cur_word = strtolower($words[$i][$j]);
            $result = mysql_query("SELECT word_id FROM word WHERE word_word = '$cur_word'");
            $row = mysql_fetch_array($result);
            if( $row['word_id'] )
            {
                /* If yes, use the old word_id: */
                $word_id = $row['word_id'];
            }
            else
            {
                /* If not, create one: */
                mysql_query("INSERT INTO word (word_word) VALUES (\"$cur_word\")");
                $word_id = mysql_insert_id();
            }
             /* And finally, register the occurrence of the word: */
            mysql_query("INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id)");
            print "Indexing: $cur_word
";
        }
    }
}
fclose($fd);
?>

Basically, this script connects to the database, registers the URL (the page) in the database (if it's not already there), starts to retrieve data, uses the preg_match_all()-function to extract the words from the page, and then creates a record in the occurrence-table and/or the word-table for the currently processed word. So, for example, if the script finds the word 'linux' onhttp://www.onlamp.com, it will execute the following INSERT-statements:
INSERT INTO page (page_url) VALUES ("http://www.onlamp.com");
INSERT INTO word (word_word) VALUES ("linux");
INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id);
However, this is only true if http://www.onlamp.com has not been indexed yet, and that this occurence of 'linux' is the first one. If 'linux' occurs once more further down on the page, the two first statements will not get executed, and the 'old' page_id and word_id will be used again.
Let's now index a few pages. The seven sites that makes up the O'Reilly Network is probably a good idea. So, call populate.php with your browser using the site URLs as the only argument, one at a time:
http://localhost/populate.php?url=http://www.macdevcenter.com
http://localhost/populate.php?url=http://www.onjava.com
http://localhost/populate.php?url=http://www.onlamp.com
http://localhost/populate.php?url=http://www.openp2p.com
http://localhost/populate.php?url=http://www.osdir.co
http://localhost/populate.php?url=http://www.perl.com
http://localhost/populate.php?url=http://www.xml.com
A quick investigation of the tables now should result in something like this:
mysql> SELECT * FROM page;

The Search Interface
Of course, users of the search-engine will not be able to work with the MySQL database directly. Therefore, we'll create another PHP-script that makes it possible to query the database through a HTML-form. This will work just as any other search-engine: the user enters a word in a text-box, hits Enter, and the interface presents a result-page with links to the pages which contains the word that was searched for. In this example, the order in which the pages are presented is settled by the number of times the keyword appears in each document. The search.php-script are listed below
/*
* search.php
*
* Script for searching a datbase populated with keywords by the
* populate.php-script.
*/
print "\n";
if( $_POST['keyword'] )
{
    /* Connect to the database: */
    mysql_pconnect("localhost","root","secret")
        or die("ERROR: Could not connect to database!");
    mysql_select_db("test");
    /* Get timestamp before executing the query: */
    $start_time = getmicrotime();
    /* Execute the query that performs the actual search in the DB: */
    $result = mysql_query(" SELECT
                                p.page_url AS url,
                                COUNT(*) AS occurrences
                            FROM
                                page p,
                                word w,
                                occurrence o
                            WHERE
                                p.page_id = o.page_id AND
                                w.word_id = o.word_id AND
                                w.word_word = \"".$_POST['keyword']."\"
                            GROUP BY
                                p.page_id
                            ORDER BY
                                occurrences DESC
                            LIMIT ".$_POST['results'] );
    /* Get timestamp when the query is finished: */
    $end_time = getmicrotime();
    /* Present the search-results: */
    print "

Search results for '".$_POST['keyword']."':

\n";
    for( $i = 1; $row = mysql_fetch_array($result); $i++ )
    {
        print "$i. ".$row['url']."\n";
        print "(occurrences: ".$row['occurrences'].")

\n";
    }
    /* Present how long it took the execute the query: */
    print "query executed in ".(substr($end_time-$start_time,0,5))." seconds.";
}
else
{
    /* If no keyword is defined, present the search-page instead: */
    print "

\n";
}
print "\n";
/* Simple function for retrieving the currenct timestamp in microseconds: */
function getmicrotime()
{
    list($usec, $sec) = explode(" ",microtime());
    return ((float)$usec + (float)$sec);
}
?>

for details :
http://www.devarticles.com/c/a/HTML/Building-A-Search-Engine/