I went to the the Sphinx website. They have no searching ?
That's funny.
Didn't see an index files spider like feature in the docs for
Spinx, so I went to google/advanced, and searched the terms
spider AND index
I found this on a forum...
"I was wondering if I can use this to spider multiple websites. As nice as it is to create a search engine for my own information, I am creating a website and would like to search a couple of websites that aren't mine (it's not against their TOS) for data purposes.
I am sure that answer is around here somewhere but I thought I would ask just to be certain."
Answer:
"Sphinx does not include any Web spider. But you could of course use some 3rd party spider to fetch the documents and put them into database, and then index that database."
It looks blazing fast with huge amounts of data, but appears
dependent on a database and won't spider.
___________
From what I've read about swish-e you can search meta-data, xml, have multiple indexes, can limit to document sets, and spiders off-site.
Also found a benchmark on mysql vs. swish-e full text
searching, at
http://joshr.com/src/docs/IndexingWithSwishe-Rabinowitz.pdf. Looking for a Swish-e, Spinx comparison.
Is a full text query searching something to be concerned
about ? I tend to look for the least query intensive alternative.
Not sure if it's a valid concern. Any thoughts ?
And a question, anyone familiar wit Tesseract OCR, google
open source ocr.