Can I import data from a yahoo group?

Requested and Answered by Carnuke on 2004/12/25 14:52:48

Can I import data from a yahoo group?

Taken from http://www.tt-solutions.com/en/products/yahoo2mbox/

What is it?

yahoo2mbox is a small Perl script which retrieves all messages from a mailing list archive at Yahoo! Groups (there is a missing copyright sign somewhere here probably) and stores them into a local file in MBOX which is recognized by all Unix mail readers and good many of other ones.

If you don't know what Yahoo! Groups are, you probably don't need this program. But if you want to search through the existing archive using your favourite MUA instead of Yahoo interface as I did you might like it.

The latest version is 0.17 and adds a new --last command line option as well as some other minor fixes.

Other interesting features include support for localized and age-restricted Yahoo groups. Unfortunately, automatic address unmangling doesn't work any more (as of December 2003 and probably before) because of a change in Yahoo address presentation algorithm.
... To the top
License

It is in public domain, you can do whatever you want with it. On the other hand, you didn't expect any guarantees for it anyhow, did you? Just in case, there are none and considering that the initial version this script was written in 15 minutes it surely does contain bugs -- use at your own risk!
... To the top
Requirements

You need Perl 5.004 (it might work with the previous versions but this is the earliest one I tested it with) and a bunch of modules all of which can be retrieved from CPAN including (but possibly not limited to) HTTP::Cookies, LWP::UserAgent and HTML::Parser.

The program has been only tested under Linux with Perl 5.004, 5.005, 5.6 and 5.8 and Windows 2000/XP/2003 with ActivePerl builds from 631 to 810 but it should work on the other platforms supported by Perl as well. In particular, it is neither Unix nor Windows-specific.
... To the top
Download

You can get the version 0.17 of the script here (sizes are approximate):

* compressed with gzip (Unix) (19Kb)
* compressed with zip (Windows) (19Kb)

... To the top
Usage instructions

Windows users: if you have never used Perl before, you need to download Perl from, for example, ActiveState and install it. To run this script you should enter perl yahoo2mbox.pl in a command line ("DOS") window and enter all the other parameters afterwards.

Simply run the script giving it the name of the group to download the messages from. If the group archives are limited to the members only you will need to use the --user=member_name and optionally --pass=password options, although only the first one is needed strictly speaking and you will be prompted for the password if you haven't specified it.

The output goes to the local file with the same name as the name of the group by default but this can be changed using the -o output_file option. You can control the range of messages to be retrieved (all by default) using --start and --end options. If the output file already exists, messages are appended to it unless --noresume option is given. By default, resuming starts with the message with index equal to the number of messages already present in the file but this is affected by --start option, i.e. if you started downloading from the message 100 and the process aborted after 10 messages, the next run would resume at message 11 without any special options and at message 111 -- as probably needed -- only if you specify the same --start=100 option the next time as well.

The other useful options are --proxy=url (you may also include the user name and password if your proxy needs them using the http://user:pass@host.name notation) if you're behind a firewall and --cookies if you had previously already logged in to Yahoo using Netscape or yahoo2mbox (it avoids the need to specify the login name and password each time).

If you want to access your country-specific groups you should use the --country option. Please note that only a few countries are currently supported and your help is needed to make this option work for more of them!

The last noteworthy feature is the --x-yahoo option which tells the script to insert X-Yahoo-Message-Num header into all downloaded messages containing the ordinal number of the message in the group. This may be useful to synchronize between the local mailbox and the Yahoo archives, for example.
... To the top
Problems

The most common problem seems to be related to the existence of some kind of download limit put in place by Yahoo. The older versions (before 0.13) of the script used to be very confused by the error page served by Yahoo after a certain amount of bytes (apparently it's counted in bytes and not messages) had been downloaded. The new ones should detect it automatically and stop trying to download anything (what's the point of banging the head against the wall, anyhow) after giving a corresponding error message.

The download limit disappears with time but unfortunately I don't know how long do you have to wait before it does. The only hint I can give is that there are two, apparently independent, download limits: one for the anonymous users and another one for the registered ones. So you could try downloading the messages anonymously and when you hit the limit, switch to using the username and password. Of course, this works only with the groups with public archives.

Additionally, the limit is IP address-specific so if you have a possibility to change your IP adderss (e.g. you have a dial-up connection) you could try doing this. On the other hand, if you have a direct and fast connection to the internet, using --delay option could be helpful as it seems to bypass at least some of the download limits.

There is a known bug with handling of multipart messages with internal parts of type message/rfc822 (i.e. embedded messages). Yahoo incorrectly wraps the internal headers, in particular, Content-XXX ones, as can be seen in this example. Unfortunately, there is no real workaround. If you encounter this problem, insert some spaces in the beginning of the lines following the Content-XXX header manually.
... To the top
Thanks

* To Malcolm-Rannirl for implementing suppot for using the Netscape cookies file and more.
* To Dan Libby for the idea of --resume option
* To Per Bolmstedt for the old semi-manual address unmangling code
* To Daniel Roethlisberger for country support code
* To JHB for support of age-restricted Yahoo! groups
* To Zainul M Charbiwala for implementing automatic address unmangling (unfortunately this doesn't work any longer but it was incredibly useful while it did)
* To Robin Lee Powell for bug reports
* To Daniel Sutcliffe for various contributions

If you think your name should be in this list and it is not, please contact me.
... To the top
Change log

0.17
Detect per group download limit and give better error messages for it (Paul Telford)
Added --last option (Joshua Ellis)
0.16
Added support for --country=tw (Henry H. Tan-Tenn)
Added --delay option (Malcolm Heath)
Fixed download limit detection after Yahoo change (Bill)
0.15
Fixed the script to work with new (August 2003) Yahoo page layout
Updated address unmangling table (Robert Zierer)
Detection of endless redirect loops added.
0.14
Fixed skipping advertisement pages after recent Yahoo changes
0.13
Automatic addres decoding (Zainul M Charbiwala)
--x-yahoo option (David Jaquay)
Fixed bug in handling --start option and give more precise error messages for common error situations
0.12
Support for age-restricted groups; support for ar and mx values of the --country option (JHB)
0.11
Fixed access to classified groups (Daniel Sutcliffe) and warnings when running under Perl 5.8.
0.10
Fixed another embarassing bug with --country option.
0.09
Fixed fatal bugin previous release (didn't work at all without --country switch)
0.08
Countries support added
0.07
Address unmangling added
0.06
Proxy support, fix the produced MBOX files on the fly (handle header wrapping and From stuffing ourselves as eGroups doesn't do it properly any more)
0.05
Support for using the cookies files, Windows fixes
0.04
Support for accessing the member-only archives
0.03
Updated for eGroups site layout changes
0.02
Added --start and --end option to allow retrieving just some messages (useful for resuming).
0.01
Initial release

This Q&A was found on XOOPS Web Application System : https://xoops.org/modules/smartfaq/faq.php?faqid=330