MyLinks seems to send Googlebot into a spin ?? [Module troubleshooting]

MyLinks seems to send Googlebot into a spin ??

2004/9/29 7:11
peterr
Just can't stay away
Posts: 518
Since: 2004/8/5 9

Hi,

Looking today at the web server error messages, and there were quite a few '404's from one IP address, I checked it out and it was 66.249.65.200 , which resolved to crawl-66-249-65-200.googlebot.com.

Here is a small snippet of the error logs, in time order:

Quote:

/modules/mylinks/catalog.shtml
/404.shtml
/modules/mylinks/Sponsrps-mfga-colour04.doc
/404.shtml
/modules/mylinks/products.shtml
/404.shtml
/modules/mylinks/info-faq.shtml
/404.shtml
/modules/mylinks/pricing-au.shtml
/404.shtml
/modules/mylinks/order.shtml
/404.shtml
/modules/mylinks/contactus.shtml

Knowing that none of these files exists _anywhere_ on my XOOPS website, but recognising some of the filenames from other websites, I decided to investigate the raw access logs. I don't know how the links are resolving to contain the domain name, and then append a file that can only be found by actually visiting a site from the MyLinks module ??

It is only under certain circumstances though, when Googlebot gets the 404's in relation to MyLinks. An earlier crawl returned all '200' messages (ok) as follows:

Quote:

/modules/mylinks/ratelink.php?lid=3
/modules/mylinks/brokenlink.php?lid=1
/modules/mylinks/modlink.php?lid=2
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D2
/modules/mylinks/modlink.php?lid=3
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D3
/robots.txt
/modules/mylinks/modlink.php?lid=1
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D1
/modules/mylinks/singlelink.php?lid=7
/modules/mylinks/singlelink.php?lid=1
/modules/mylinks/viewcat.php?op=&cid=2

However, a later crawl by Googlebot, returned all the 404's, and it seemed to be related only to the file viewcat.php, that is, after accessing it. Here is what happened:

Quote:

"GET /robots.txt HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=hitsA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=hitsA HTTP/1.1" 200
"GET /modules/mylinks/contactus.shtml HTTP/1.1" 404
"GET /modules/mylinks/order.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=2&orderby=dateD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=titleA HTTP/1.1" 200
"GET /modules/mylinks/pricing-au.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=2&orderby=ratingA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=titleA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=dateA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=ratingD HTTP/1.1" 200
"GET /modules/mylinks/info-faq.shtml HTTP/1.1" 404
"GET /modules/mylinks/products.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=3&orderby=hitsD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?op=&cid=3 HTTP/1.1" 200
"GET /modules/mylinks/Sponsrps-mfga-colour04.doc HTTP/1.1" 404
"GET /modules/mylinks/catalog.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=3&orderby=ratingA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=titleD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=titleD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=dateA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=dateD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=hitsD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=ratingD HTTP/1.1" 200

Can anyone explain what might be happening. If I click on one of the links, it resolves okay, but Googlebot seems to have followed the link, for example:

http://www.jehoshua.net/modules/mylinks/visit.php?lid=1

and then still somehow thought that is was on the domain jehoshua.net , because some of the files on the site that is linked to are:

info-faq.shtml
products.shtml
etc,etc

Is this a problem with Googlebot, or with the MyLinks module (I can't see how, because I have tested the url's just before the 404 messages, and they are okay), or is it because I have _also_ got the complete URL to the websites in the (mylinks) description ??

Msnbot didn't seem to have any trouble going through the /modules/mylinks path and following the links, although it didn't seem to 'dig' as deep as Google.

Should I 'disallow' the file /modules/mylinks/viewcat.php in robots.txt, just as a temporary measure, in the hope that crawlers/spiders will follow the rules ??

Beats me, has anyone a solution for this, or found the same problem ??

Peter

Re: MyLinks seems to send Googlebot into a spin ??

2004/9/29 7:16
peterr
Just can't stay away
Posts: 518
Since: 2004/8/5 9

Hi,

Should I consider something like this hack ?

ShortURLs 0.3 for XOOPS

https://xoops.org/modules/mydownloads/visit.php?cid=28&lid=630

Peter

Re: MyLinks seems to send Googlebot into a spin ??

2004/9/29 13:43
tl
Friend of XOOPS
Posts: 999
Since: 2002/6/23

What is the crawler's user agent?

Is it something like the following?
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The usual Google user agent is
"Googlebot/2.1 (+http://www.google.com/bot.html)"

I suspect the first one is a faked Google crawler - I have been hit continuously for a couple of days now.

Re: MyLinks seems to send Googlebot into a spin ??

2004/9/30 10:12
peterr
Just can't stay away
Posts: 518
Since: 2004/8/5 9

Hi,

Quote:

tl wrote:
What is the crawler's user agent?

Is it something like the following?
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The usual Google user agent is
"Googlebot/2.1 (+http://www.google.com/bot.html)"

I suspect the first one is a faked Google crawler - I have been hit continuously for a couple of days now.

Yes, thanks for pointing that out. If it is a faked crawler, the "real" Googlebot is agent:

"Googlebot/2.1 (+http://www.google.com/bot.html)"

whilst the fake one is:

"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

however, the IP address of the 'fake' ones resolve to domain google.com, however no doubt they would spoof the IP address.

I'll find out more from searchengineworld and a few other places, and if 100% certain it is a fake bot, then I'll ban the agent.

There were no "404's" in the _real_ Googlebot, they only appeared from the IP addreses of the other one.

Thanks,

Peter

Stats
Goal:	AU$15.00
Due Date:	Sep 30
Gross Amount:	AU$0.00
Net Balance:	AU$0.00
Left to go:	AU$15.00

MyLinks seems to send Googlebot into a spin ??

Re: MyLinks seems to send Googlebot into a spin ??

Re: MyLinks seems to send Googlebot into a spin ??

Re: MyLinks seems to send Googlebot into a spin ??

Login

Search

Recent Posts

Re: XOOPS 2.5.12 Beta-8 available for Testing

Re: XOOPS 2.5.12 Beta-8 available for Testing

XOOPS 2.5.12 Beta-8 available for Testing

Re: XOOPS 2.5 Theme - Guanga by 8pm!!!

Re: XOOPS 2.5 Theme - Guanga by 8pm!!!

Re: XOOPS 2.5 Theme - Guanga by 8pm!!!

Re: XOOPS 2.5 Theme - Guanga by 8pm!!!

XOOPS 2.5 Theme - Guanga by 8pm!!!

Re: Cookie consent message

Re: imponeer/archive-download-builder — Modern Replacement for XOOPS Downloader Classes

Who's Online

Donat-O-Meter

Latest GitHub Commits

{{ record.sha.slice(0, 7) }} - {{ record.commit.message | truncate }}