|
OK, so you've checked your log files and there's always these funny entries that has the "user agent" (the application that you use to access the website ie. Mozilla Firefox, Opera or Internet Explorer) set to "spider" or "robot" or "googlebot" or "crawler". What in the world?! It's sucking 8GB off my site every month! Who gave it permission? How'd it get here? Who do I report this to?
Whoa! Calm down Tonto, it's a search engine crawler indexing your page so it can add it to its index.
(First off, this will be the first article that only members can access the complete article).
We're going to cover the following topics in this article:
1. Crawlers - What are they?(Accessible)
2. Search Engine Indexes - What are they? (Accessible)
3. How do crawlers access my website / pages? (Need to log in to read)
4. What do they access on my pages? (Need to log in to read)
5. Where do crawlers get their lists from? (Need to log in to read)
Crawlers - What are they?
Well, crawlers are the second highest bandwidth users on the internet (or third, behind downloads. SPAM being the highest bandwidth user - My opinion, that is). Basically, they access your websites and download the written text and usually nothing in the background, under the bonnet HTML & CSS. This typically looks something like this:
<html>
<head>
<meta tags>
<description tags>
<title tags>
<script or two>
<linked in css style sheet or two>
<other stuff>
</head>
<body>
your writing goes here... rhubarb rhubarb, gurgle, talk talk... <some images>< a table or two><some headings>Etc.
</body>
</html>
The crawler usually ignores the different style sheets, scripts and stuff in the <head> section. It is primarily interested in the "KEYWORDS" and "DESCRIPTION" and "TITLE" meta tags. These tell the crawler a lot about the page. It paints context. (Tip: make sure your titles, descriptions and meta tags are relevant to the content).
Now, when it's done with this (memorising them). It progresses to index the content of the page. The content is the stuff between the <body> ... </body> tags. Most (but not all) ignore everything that is not written e.g. scripts, images, flash, videos, etc. The bigger ones check the images and other things, too. But, let's stick to the general stuff.
OK, so now it's extracted the writing from the page. It then processes this writing according to different rules. Don't ask me what these are, cause every search engine has built their own rules. Some engines throw away noise words like this, that, I, we, and, if, of, etc. Some index them. The bigger ones index phrases like "what do you think about gherkins" or "why is makemoneyonline.co.za such an awesome site?". Yes, some even index the question mark! This information is then stored in their index from where the searches are then conducted.
(Tip: Some search engines ignore pages under certain sizes. They classify them as non-informative and even sometimes penalise the site).
Search Engine Indexes - What are they?
Here, I'm going to borrow from one of my favourite sites Wikipedia.
Search engine indexing collects, parses, and stores data to facilitate fast and
accurate information retrieval. Index design incorporates interdisciplinary
concepts from linguistics, cognitive psychology, mathematics,informatics, physics
and computer science. An alternate name for the process in the context of search
engines designed to find web pages on the Internet is Web indexing.
In a nutshell? Almost like a normal book's word reference section. You know, back of the book, the section where you can look up a word / concept and it tells you what page you can go look on. You know? Almost like that, except much bigger and much more complex.
To read the rest, you're going to have to log in.
|