1
jdseymour
Search spiders

Anyone know why I would have 36 spiders on my site at one time?



Further checking in cpanel counted 57 different ips for inktomisearch.com. I believe that's Yahoo, right?

2
JasonMR
Re: Search spiders
  • 2004/12/13 3:54

  • JasonMR

  • Just can't stay away

  • Posts: 655

  • Since: 2004/6/21


That can be normal...depending on your robot.txt settings. You will find this file in your index folder.

I've copied and pasted the following from Search Engine World (they have more info regarding spiders,bots, etc...):

Quote:

Robots.txt Tutorial
Search engines will look in your root domain for a special file named "robots.txt" (http://www.mydomain.com/robots.txt). The file tells the robot (spider) which files it may spider (download). This system is called, The Robots Exclusion Standard.

The format for the robots.txt file is special. It consists of records. Each record consists of two fields : a User-agent line and one or more Disallow: lines. The format is:

<Field> ":" <value>

The robots.txt file should be created in Unix line ender mode! Most good text editors will have a Unix mode or your FTP client *should* do the conversion for you. Do not attempt to use an HTML editor that does not specifically have a text mode to create a robots.txt file.

User-agent

The User-agent line specifies the robot. For example:

User-agent: googlebot

You may also use the wildcard charcter "*" to specify all robots:

User-agent: *

You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders.

Disallow:

The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm:

Disallow: email.htm

You may also specify directories:

Disallow: /cgi-bin/

Which would block spiders from your cgi-bin directory.

There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).

If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct. A completely empty Robots.txt file is the same as if it were not present.

White Space & Comments

Any line in the robots.txt that begins with # is considered to be a comment only. The standard allows for comments at the end of directive lines, but this is really bad style:

Disallow: bob #comment

Some spider will not interpret the above line correctly and instead will attempt to disallow "bob#comment". The moral is to place comments on lines by themselves.

White space at the beginning of a line is allowed, but not recommended.

Disallow: bob #comment

Examples

The following allows all robots to visit all files because the wildcard "*" specifies all robots.

User-agent: *
Disallow:

This one keeps all robots out.

User-agent: *
Disallow: /

The next one bars all robots from the cgi-bin and images directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

This one bans Roverdog from all files on the server:

User-agent: Roverdog
Disallow: /

This one bans keeps googlebot from getting at the cheese.htm file:

User-agent: googlebot
Disallow: cheese.htm

For more complex examples, try retrieving some of the robots.txt files from the big sites like Cnn, or Looksmart.

Extensions to the Standard
Although there have been proposed standards extetions such as an Allow line or robot version control, there has been no formal endorsement by the Robots exclusion standard working group.


Hope this helps...though follow the above link if you want to learn more about the topic...

3
jdseymour
Re: Search spiders

Thanks, it was a big surprise when I cam out of the admin section and saw the long list in xmemberstats.

I have had this site for several years, and seen maybe 2 or 3 bots at a time, but never over 30. I will check the link, and my robots.txt I believe is set right, will double check though.

4
danielh2o
Re: Search spiders
  • 2004/12/13 11:15

  • danielh2o

  • Just popping in

  • Posts: 47

  • Since: 2004/10/19


For general security sake, anyone can recommend a list of files/directories to be DISALLOW or ALLOW at robots.txt?
(e.g. DISALLOW mainfile.php!?)

I doubt whether mainfile.php is secure?
chmod to 444, seems to be filesystem access-rights only.

https://xoops.org/modules/newbb/viewtopic.php?topic_id=28453&forum=7&post_id=124075

Welcome for comments/discussions such that everyone can learn more about this mainfile.php...

5
carnuke
Re: Search spiders
  • 2004/12/13 11:22

  • carnuke

  • Home away from home

  • Posts: 1955

  • Since: 2003/11/5


Quote:
Welcome for comments/discussions such that everyone can learn more about this mainfile.php...


Take a look on the XOOPS FAQ in the security category

6
jdseymour
Re: Search spiders

Sorry for resurrecting my old post, but I need an opinion on this.

This is the second month that my site has been open, before XOOPS I had little to no traffic. Now with XOOPS I get around 45 visitors a day, give or take.

These are the google and msnbot summary for the month of December.

Googlebot 3158+65 78.57 MB 31 Dec 2004 - 20:34
MSNBot 1712+218 39.34 MB 31 Dec 2004 - 23:49

These are listed as #hits + #robottext hits.

Is this normal, high or low for google and msn?

Still new to this and learning, but my search engine hits are going up.

7
astonstreet
Re: Search spiders

Quote:

Googlebot 3158+65 78.57 MB 31 Dec 2004 - 20:34
MSNBot 1712+218 39.34 MB 31 Dec 2004 - 23:49

Those figures sound normal to me. My site is slightly less then 2 months old and I get a lot of my traffic from Google.

8
jdseymour
Re: Search spiders

Thanks for the reply.

Login

Who's Online

226 user(s) are online (128 user(s) are browsing Support Forums)


Members: 0


Guests: 226


more...

Donat-O-Meter

Stats
Goal: $100.00
Due Date: May 31
Gross Amount: $0.00
Net Balance: $0.00
Left to go: $100.00
Make donations with PayPal!

Latest GitHub Commits