Search spiders [XOOPS.org Members Lounge] - XOOPS Web Application System

1

Search spiders

2004/12/13 3:13
jdseymour
Friend of XOOPS
Posts: 3782
Since: 2004/11/11

Anyone know why I would have 36 spiders on my site at one time?

Further checking in cpanel counted 57 different ips for inktomisearch.com. I believe that's Yahoo, right?

2

Re: Search spiders

2004/12/13 3:54
JasonMR
Just can't stay away
Posts: 655
Since: 2004/6/21

That can be normal...depending on your robot.txt settings. You will find this file in your index folder.

I've copied and pasted the following from Search Engine World (they have more info regarding spiders,bots, etc...):

Quote:

Robots.txt Tutorial
Search engines will look in your root domain for a special file named "robots.txt" (http://www.mydomain.com/robots.txt). The file tells the robot (spider) which files it may spider (download). This system is called, The Robots Exclusion Standard.

The format for the robots.txt file is special. It consists of records. Each record consists of two fields : a User-agent line and one or more Disallow: lines. The format is:

":"

The robots.txt file should be created in Unix line ender mode! Most good text editors will have a Unix mode or your FTP client *should* do the conversion for you. Do not attempt to use an HTML editor that does not specifically have a text mode to create a robots.txt file.

User-agent

The User-agent line specifies the robot. For example:

User-agent: googlebot

You may also use the wildcard charcter "*" to specify all robots:

User-agent: *

You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders.

Disallow:

The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm:

Disallow: email.htm

You may also specify directories:

Disallow: /cgi-bin/

Which would block spiders from your cgi-bin directory.

There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).

If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct. A completely empty Robots.txt file is the same as if it were not present.

White Space & Comments

Any line in the robots.txt that begins with # is considered to be a comment only. The standard allows for comments at the end of directive lines, but this is really bad style:

Disallow: bob #comment

Some spider will not interpret the above line correctly and instead will attempt to disallow "bob#comment". The moral is to place comments on lines by themselves.

White space at the beginning of a line is allowed, but not recommended.

Disallow: bob #comment

Examples

The following allows all robots to visit all files because the wildcard "*" specifies all robots.

User-agent: *
Disallow:

This one keeps all robots out.

User-agent: *
Disallow: /

The next one bars all robots from the cgi-bin and images directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

This one bans Roverdog from all files on the server:

User-agent: Roverdog
Disallow: /

This one bans keeps googlebot from getting at the cheese.htm file:

User-agent: googlebot
Disallow: cheese.htm

For more complex examples, try retrieving some of the robots.txt files from the big sites like Cnn, or Looksmart.

Extensions to the Standard
Although there have been proposed standards extetions such as an Allow line or robot version control, there has been no formal endorsement by the Robots exclusion standard working group.

Hope this helps...though follow the above link if you want to learn more about the topic...

3

Re: Search spiders

2004/12/13 4:01
jdseymour
Friend of XOOPS
Posts: 3782
Since: 2004/11/11

Thanks, it was a big surprise when I cam out of the admin section and saw the long list in xmemberstats.

I have had this site for several years, and seen maybe 2 or 3 bots at a time, but never over 30.

I will check the link, and my robots.txt I believe is set right, will double check though.

4

Re: Search spiders

2004/12/13 11:15
danielh2o
Just popping in
Posts: 47
Since: 2004/10/19

For general security sake, anyone can recommend a list of files/directories to be DISALLOW or ALLOW at robots.txt?
(e.g. DISALLOW mainfile.php!?)

I doubt whether mainfile.php is secure?
chmod to 444, seems to be filesystem access-rights only.

https://xoops.org/modules/newbb/viewtopic.php?topic_id=28453&forum=7&post_id=124075

Welcome for comments/discussions such that everyone can learn more about this mainfile.php...

5

Re: Search spiders

2004/12/13 11:22
carnuke
Home away from home
Posts: 1955
Since: 2003/11/5

Quote:

Welcome for comments/discussions such that everyone can learn more about this mainfile.php...

Take a look on the XOOPS FAQ in the security category

6

Re: Search spiders

2005/1/2 2:12
jdseymour
Friend of XOOPS
Posts: 3782
Since: 2004/11/11

Sorry for resurrecting my old post, but I need an opinion on this.

This is the second month that my site has been open, before XOOPS I had little to no traffic. Now with XOOPS I get around 45 visitors a day, give or take.

These are the google and msnbot summary for the month of December.

Googlebot 3158+65 78.57 MB 31 Dec 2004 - 20:34
MSNBot 1712+218 39.34 MB 31 Dec 2004 - 23:49

These are listed as #hits + #robottext hits.

Is this normal, high or low for google and msn?

Still new to this and learning, but my search engine hits are going up.

7

Re: Search spiders

2005/1/2 4:02
astonstreet
Just popping in
Posts: 35
Since: 2004/10/18

Quote:

Googlebot 3158+65 78.57 MB 31 Dec 2004 - 20:34
MSNBot 1712+218 39.34 MB 31 Dec 2004 - 23:49

Those figures sound normal to me. My site is slightly less then 2 months old and I get a lot of my traffic from Google.

8

Re: Search spiders

2005/1/2 4:03
jdseymour
Friend of XOOPS
Posts: 3782
Since: 2004/11/11

Thanks for the reply.

Stats
Goal:	$15.00
Due Date:	Jan 31
Gross Amount:	$0.00
Net Balance:	$0.00
Left to go:	$15.00

Search spiders

Re: Search spiders

Re: Search spiders

Re: Search spiders

Re: Search spiders

Re: Search spiders

Re: Search spiders

Re: Search spiders

Login

Search

Recent Posts

Re: TPL admin xsitemap

Re: TPL admin xsitemap

Re: TPL admin xsitemap

TPL admin xsitemap

Re: Happy Holidays!

Happy Holidays!

Re: New Xoops Xcreate Module

New Xoops Xcreate Module

Re: New Xoops Custom Field Module

New Xoops Custom Field Module

Who's Online

Donat-O-Meter

Latest GitHub Commits

{{ record.sha.slice(0, 7) }} - {{ record.commit.message | truncate }}