http://www.vaishnotechnologies.com/

Thursday, May 6, 2010

Understanding the Search Engine

The Goal of Search Engine
Many People think search engine have a hidden agenda. This simply is not true. The goal of search engine is to provide high quality content to people searching the internet. Search engine with the broadest distribution network sell the most advertising space. As I write this, Yahoo! And Google are considered the search engines with the best relevancy. Their technologies power the bulk of web search.

The Problem Listing a New Site

The biggest problem new website have is that search engines have no idea they exist. Even when a search engines find a new documents, it has hard time determining its quality. Search engines really on links to help determine the quality of a document. Some engines, such as Google, also trust website more as they age. The following bits may contain a few advanced search topics. It is fine if you do not necessarily understand them right away. The average webmaster dose not needs to know in depth search technologies. Some might be interested in it, so I wrote a bit about it.

Parts of a Search Engine

While there are different ways to organize web content, every crawling search engines has the same basic parts .Each consists of:

• A crawler
• An index (or catalog)
• And a search interface

Crawler (or Spider):

The crawler does just what its name implies. It scours the web following links, updating pages, and adding new pages when it comes across them. Each search engine has periods of deep crawling and periods of shallow crawling. There is also a scheduler mechanism to prevent a spider from overloading servers and to tell the spider what document or crawl next and how frequent to crawl them.

Rapidly changing or highly important document are more likely to get crawled frequently. The frequency of crawl should typically have a little effect on search relevancy; it simply helps the search engines keep fresh content in their index. The home page of CNN.com might get crawled once every 10 minutes. A popular rapidly growing forum might get crawled a few dozen times each day. A static site with little link popularity and rarely changing content might only get crawled once or twice a month.

The best benefit of having a frequently crawled page is that you can get your new sites, pages, or projects crawled quickly by linking to them from a powerful or frequently changing page.

The Index

The index is where the spider is collected data is stored. When you perform a search on a major search engines, you are not searching a web, but the cache of the web provided by that search engine’s index.

Reverse Index

Search engines organize their content in what is called a “reverse index.” It sorts web documents by words. When you search Google and it displays 1-10 out of 143,000 website it means that there are approximately 143,000 web pages which either have the words on them, or have inbound links containing them.

Search engines do not store punctuation, just words. The following example reverse index is overly simplified for clarity. Imagine each of the following sentences is the content of a unique page.

The dog ate the cat.
The cat ate the mouse.

Word Document Position
The 1,2 1,1,4,1
Dog 1 2
Ate 1,2 3,3
Cat 1,2 5,2
Mouse 2 5

Stop Words

Words which are common do not help search engines understand documents. Exceptionally common terms, such as the, ate called stop words. While search engines index stop words, they are not used to determine relevancy in search algorithms. If I search the cat in the hat search engines may insert wildcards for the words the and in, so my search will look like “*cat**hat”.

Index Normalization

Each page is standardized to a size. This prevents longer pages from having an unfair advantage by using a term many more times throughout long page copy. This also prevents short pages for scoring arbitrarily high by having a high percentage of their page copy composed of a few keywords phrases. Thus, there is no magical page copy length which is best for all search engines.

The uniqueness of page content is far more important than the length. The three best purpose for page copy are:

• To be unique enough to get indexed and ranked in the search result.
• Create content that people find interesting enough to want to link at.
• Convert site visitors to subscribers, buyers, or people who click on ads.

Not every page is going to make sales or be compelling enough to link at, but if in aggregate many of your pages are of high quality over time it will help boost the rankings of every page on your site.

Term Frequency

Term frequency (T.F.) is a weighted measure of how often a term appears in a document. Terms which are frequently occurring within a document are thought to be some of the more important terms for that document.

Search Interface

The search algorithm and search interface are used to find the most relevant documents in the index based on the search. First the search engines tries to determine user intent by looking at the words the searcher types in. These terms can be stripped down to their root level—drop ing and other suffixes- - and checked against a lexical database to see what concepts they represent. Terms which are a near match will help you rank for other similarly terms. For example, using the words swims could help you rank well for swim or swimming.

Search engines can try to match keywords vectors with each of the specific of terms in a query, or try to match up with the related concepts to the search query as a phrase if the words in the query are seen as part of a larger conceptual unit.