1
peterr
MyLinks seems to send Googlebot into a spin ??
  • 2004/9/29 7:11

  • peterr

  • Just can't stay away

  • Posts: 518

  • Since: 2004/8/5 9


Hi,

Looking today at the web server error messages, and there were quite a few '404's from one IP address, I checked it out and it was 66.249.65.200 , which resolved to crawl-66-249-65-200.googlebot.com.

Here is a small snippet of the error logs, in time order:

Quote:

/modules/mylinks/catalog.shtml
/404.shtml
/modules/mylinks/Sponsrps-mfga-colour04.doc
/404.shtml
/modules/mylinks/products.shtml
/404.shtml
/modules/mylinks/info-faq.shtml
/404.shtml
/modules/mylinks/pricing-au.shtml
/404.shtml
/modules/mylinks/order.shtml
/404.shtml
/modules/mylinks/contactus.shtml


Knowing that none of these files exists _anywhere_ on my XOOPS website, but recognising some of the filenames from other websites, I decided to investigate the raw access logs. I don't know how the links are resolving to contain the domain name, and then append a file that can only be found by actually visiting a site from the MyLinks module ??

It is only under certain circumstances though, when Googlebot gets the 404's in relation to MyLinks. An earlier crawl returned all '200' messages (ok) as follows:

Quote:

/modules/mylinks/ratelink.php?lid=3
/modules/mylinks/brokenlink.php?lid=1
/modules/mylinks/modlink.php?lid=2
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D2
/modules/mylinks/modlink.php?lid=3
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D3
/robots.txt
/modules/mylinks/modlink.php?lid=1
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D1
/modules/mylinks/singlelink.php?lid=7
/modules/mylinks/singlelink.php?lid=1
/modules/mylinks/viewcat.php?op=&cid=2


However, a later crawl by Googlebot, returned all the 404's, and it seemed to be related only to the file viewcat.php, that is, after accessing it. Here is what happened:

Quote:

"GET /robots.txt HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=hitsA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=hitsA HTTP/1.1" 200
"GET /modules/mylinks/contactus.shtml HTTP/1.1" 404
"GET /modules/mylinks/order.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=2&orderby=dateD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=titleA HTTP/1.1" 200
"GET /modules/mylinks/pricing-au.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=2&orderby=ratingA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=titleA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=dateA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=ratingD HTTP/1.1" 200
"GET /modules/mylinks/info-faq.shtml HTTP/1.1" 404
"GET /modules/mylinks/products.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=3&orderby=hitsD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?op=&cid=3 HTTP/1.1" 200
"GET /modules/mylinks/Sponsrps-mfga-colour04.doc HTTP/1.1" 404
"GET /modules/mylinks/catalog.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=3&orderby=ratingA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=titleD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=titleD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=dateA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=dateD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=hitsD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=ratingD HTTP/1.1" 200


Can anyone explain what might be happening. If I click on one of the links, it resolves okay, but Googlebot seems to have followed the link, for example:

http://www.jehoshua.net/modules/mylinks/visit.php?lid=1

and then still somehow thought that is was on the domain jehoshua.net , because some of the files on the site that is linked to are:

info-faq.shtml
products.shtml
etc,etc

Is this a problem with Googlebot, or with the MyLinks module (I can't see how, because I have tested the url's just before the 404 messages, and they are okay), or is it because I have _also_ got the complete URL to the websites in the (mylinks) description ??

Msnbot didn't seem to have any trouble going through the /modules/mylinks path and following the links, although it didn't seem to 'dig' as deep as Google.

Should I 'disallow' the file /modules/mylinks/viewcat.php in robots.txt, just as a temporary measure, in the hope that crawlers/spiders will follow the rules ??

Beats me, has anyone a solution for this, or found the same problem ??

Peter

2
peterr
Re: MyLinks seems to send Googlebot into a spin ??
  • 2004/9/29 7:16

  • peterr

  • Just can't stay away

  • Posts: 518

  • Since: 2004/8/5 9


Hi,

Should I consider something like this hack ?

ShortURLs 0.3 for XOOPS

https://xoops.org/modules/mydownloads/visit.php?cid=28&lid=630

Peter


3
tl
Re: MyLinks seems to send Googlebot into a spin ??
  • 2004/9/29 13:43

  • tl

  • Friend of XOOPS

  • Posts: 999

  • Since: 2002/6/23


What is the crawler's user agent?

Is it something like the following?
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The usual Google user agent is
"Googlebot/2.1 (+http://www.google.com/bot.html)"

I suspect the first one is a faked Google crawler - I have been hit continuously for a couple of days now.

4
peterr
Re: MyLinks seems to send Googlebot into a spin ??
  • 2004/9/30 10:12

  • peterr

  • Just can't stay away

  • Posts: 518

  • Since: 2004/8/5 9


Hi,

Quote:

tl wrote:
What is the crawler's user agent?

Is it something like the following?
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The usual Google user agent is
"Googlebot/2.1 (+http://www.google.com/bot.html)"

I suspect the first one is a faked Google crawler - I have been hit continuously for a couple of days now.


Yes, thanks for pointing that out. If it is a faked crawler, the "real" Googlebot is agent:

"Googlebot/2.1 (+http://www.google.com/bot.html)"

whilst the fake one is:

"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

however, the IP address of the 'fake' ones resolve to domain google.com, however no doubt they would spoof the IP address.

I'll find out more from searchengineworld and a few other places, and if 100% certain it is a fake bot, then I'll ban the agent.

There were no "404's" in the _real_ Googlebot, they only appeared from the IP addreses of the other one.

Thanks,

Peter

Login

Who's Online

331 user(s) are online (57 user(s) are browsing Support Forums)


Members: 0


Guests: 331


more...

Donat-O-Meter

Stats
Goal: $100.00
Due Date: Oct 31
Gross Amount: $0.00
Net Balance: $0.00
Left to go: $100.00
Make donations with PayPal!

Latest GitHub Commits