Hi,
Looking today at the web server error messages, and there were quite a few '404's from one IP address, I checked it out and it was 66.249.65.200 , which resolved to crawl-66-249-65-200.googlebot.com.
Here is a small snippet of the error logs, in time order:
Quote:
/modules/mylinks/catalog.shtml
/404.shtml
/modules/mylinks/Sponsrps-mfga-colour04.doc
/404.shtml
/modules/mylinks/products.shtml
/404.shtml
/modules/mylinks/info-faq.shtml
/404.shtml
/modules/mylinks/pricing-au.shtml
/404.shtml
/modules/mylinks/order.shtml
/404.shtml
/modules/mylinks/contactus.shtml
Knowing that none of these files exists _anywhere_ on my XOOPS website, but recognising some of the filenames from other websites, I decided to investigate the raw access logs. I don't know how the links are resolving to contain the domain name, and then append a file that can only be found by actually visiting a site from the MyLinks module ??
It is only under certain circumstances though, when Googlebot gets the 404's in relation to MyLinks. An earlier crawl returned all '200' messages (ok) as follows:
Quote:
/modules/mylinks/ratelink.php?lid=3
/modules/mylinks/brokenlink.php?lid=1
/modules/mylinks/modlink.php?lid=2
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D2
/modules/mylinks/modlink.php?lid=3
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D3
/robots.txt
/modules/mylinks/modlink.php?lid=1
/user.php?xoops_redirect=%2Fmodules%2Fmylinks%2Fmodlink.php%3Flid%3D1
/modules/mylinks/singlelink.php?lid=7
/modules/mylinks/singlelink.php?lid=1
/modules/mylinks/viewcat.php?op=&cid=2
However, a later crawl by Googlebot, returned all the 404's, and it seemed to be related only to the file viewcat.php, that is, after accessing it. Here is what happened:
Quote:
"GET /robots.txt HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=hitsA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=hitsA HTTP/1.1" 200
"GET /modules/mylinks/contactus.shtml HTTP/1.1" 404
"GET /modules/mylinks/order.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=2&orderby=dateD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=titleA HTTP/1.1" 200
"GET /modules/mylinks/pricing-au.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=2&orderby=ratingA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=titleA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=dateA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=ratingD HTTP/1.1" 200
"GET /modules/mylinks/info-faq.shtml HTTP/1.1" 404
"GET /modules/mylinks/products.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=3&orderby=hitsD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?op=&cid=3 HTTP/1.1" 200
"GET /modules/mylinks/Sponsrps-mfga-colour04.doc HTTP/1.1" 404
"GET /modules/mylinks/catalog.shtml HTTP/1.1" 404
"GET /modules/mylinks/viewcat.php?cid=3&orderby=ratingA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=titleD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=titleD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=dateA HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=dateD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=2&orderby=hitsD HTTP/1.1" 200
"GET /modules/mylinks/viewcat.php?cid=3&orderby=ratingD HTTP/1.1" 200
Can anyone explain what might be happening. If I click on one of the links, it resolves okay, but Googlebot seems to have followed the link, for example:
http://www.jehoshua.net/modules/mylinks/visit.php?lid=1and then still somehow thought that is was on the domain jehoshua.net , because some of the files on the site that is linked to are:
info-faq.shtml
products.shtml
etc,etc
Is this a problem with Googlebot, or with the MyLinks module (I can't see how, because I have tested the url's just before the 404 messages, and they are okay), or is it because I have _also_ got the complete URL to the websites in the (mylinks) description ??
Msnbot didn't seem to have any trouble going through the /modules/mylinks path and following the links, although it didn't seem to 'dig' as deep as Google.
Should I 'disallow' the file /modules/mylinks/viewcat.php in robots.txt, just as a temporary measure, in the hope that crawlers/spiders will follow the rules ??
Beats me, has anyone a solution for this, or found the same problem ??
Peter