Fan Fiction Fetcher, now with added LWP
Sep. 10th, 2011 11:07 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[community profile]](https://www.dreamwidth.org/img/silk/identity/community.png)
For those of you who were interested in my fanfic-fetching perl script, I've just released version 0.16 of WWW-FetchStory (http://search.cpan.org/~rubykat/WWW-FetchStory-0.16/) (well, that will be the URL when CPAN finishes processing it).
The big news: it no longer depends on wget! It uses the LWP perl module instead. This means that MS-Windows users should be able to use the script (fingers crossed).
I have retained the option to use wget, because some sites work with wget that don't work with LWP.(*)
There are a bunch of other improvements, and another new fetcher (Project Gutenberg), but the LWP stuff is the important bit.
(*) I have spent HOURS trying to get LWP + Cookies to work with LiveJournal, but no joy, and I have given up. LWP and Cookies work with other sites (I tried it on Ashwinder) but not with LJ. (throws hands in air) Anyone who can figure out why the cookies sometimes work and sometimes don't, that would be great. I have pored over debugging output, I have made observations with wireshark... The only difference seems to be that wget sends the right cookies and LWP only sends some of the right cookies.
The big news: it no longer depends on wget! It uses the LWP perl module instead. This means that MS-Windows users should be able to use the script (fingers crossed).
I have retained the option to use wget, because some sites work with wget that don't work with LWP.(*)
There are a bunch of other improvements, and another new fetcher (Project Gutenberg), but the LWP stuff is the important bit.
(*) I have spent HOURS trying to get LWP + Cookies to work with LiveJournal, but no joy, and I have given up. LWP and Cookies work with other sites (I tried it on Ashwinder) but not with LJ. (throws hands in air) Anyone who can figure out why the cookies sometimes work and sometimes don't, that would be great. I have pored over debugging output, I have made observations with wireshark... The only difference seems to be that wget sends the right cookies and LWP only sends some of the right cookies.
no subject
Date: 2011-09-10 01:27 am (UTC)no subject
Date: 2011-09-10 01:48 am (UTC)I have just now tested the script to fetch a locked post from Dreamwidth, and the same locked post made by the same person (for whom I have the same level of access on their DW and their LJ journals), both using the same exported cookie file which contains the appropriate session cookies... and it succeeds for Dreamwidth, but fails for Livejournal.
Fails in the sense that Dreamwidth gives me the page I want, and Livejournal gives me a "please log in" page.
8-(
no subject
Date: 2011-09-10 02:27 am (UTC)OH! Also, have you viewed the LJ jounal with the browser you exported the cookies form? There's a per-journal cookie that needs to be set (unless that's what you meant by per-session cookie)
no subject
Date: 2011-09-10 02:32 am (UTC)I don't think so. Or, I should say, it fails on both non-locked posts with adult content, and on locked posts without adult content.
Also, have you viewed the LJ jounal with the browser you exported the cookies form? There's a per-journal cookie that needs to be set (unless that's what you meant by per-session cookie)
Yes, I have, and yes it is.
no subject
Date: 2011-09-10 07:10 pm (UTC)Ugh, yeah I'm stumped :-(
no subject
Date: 2011-09-10 11:17 pm (UTC)The sequence of events is this:
1) Load cookies from the cookie file. Both LWP and wget read all the cookies correctly.
2) Send a GET request for the URL.
This is the GET request sent by wget:
------------------
GET /571792.html?format=light HTTP/1.0^M
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: tptigger.livejournal.com
Connection: Keep-Alive
Cookie: ljdomsess.tptigger=v1:u1359334:s1658:t1315566000:g4eff918a17f6712f1c4b55b3ee9e0102f99b6372//1; BMLschemepref=vertigo; __qca=P0-1371400743-1309003285157; __unam=37d207d-12e74531b06-113e375f-1000; __utma=48425145.223224461.1309003285.1309003285.1309003285.1; __utmc=48425145; __utmz=48425145.1309003285.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); cart=; langpref=en_LJ/1314404400; ljloggedin=v1:u1359334:s1658:t1311802976:ga4e1c230bbbe55184ea4ac827346291cac618e59; ljsession=v1:u1359334:s1658:t1314403200:g3511fa97d76eb492f381533c9bc3f387de3f3059//1; ljuniq=ufTCu246eDSh5R5%3A1310599401%3Apgstats0; mlm-alert-read=; mlm-alert-start=; mlm-dialog-ary=; s_cc=true; s_sq=sixapartlivejournal%3D%2526pid%253Dhttps%25253A//www.livejournal.com/login.bml%2526oid%253DLog%252520in...%2526oidt%253D3%2526ot%253DSUBMIT; show_sponsored_styles=1; show_sponsored_vgifts=0; vertical_tags=1313106597%7Cvgam%3A6,vtec%3A6
------------------
This is the GET request sent by LWP:
------------------
GET http://tptigger.livejournal.com/571792.html?format=light
Connection: keep-alive
Accept-Encoding: gzip, x-gzip, deflate, x-bzip2
User-Agent: libwww-perl/5.837
Cookie: mlm-dialog-ary=; BMLschemepref=vertigo; ljloggedin=v1:u1359334:s1658:t1311802976:ga4e1c230bbbe55184ea4ac827346291cac618e59; __utmc=48425145; vertical_tags=1313106597%7Cvgam%3A6,vtec%3A6; mlm-alert-read=; s_cc=true; __utmz=48425145.1309003285.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); langpref=en_LJ/1314404400; s_sq=sixapartlivejournal%3D%2526pid%253Dhttps%25253A//www.livejournal.com/login.bml%2526oid%253DLog%252520in...%2526oidt%253D3%2526ot%253DSUBMIT; __unam=37d207d-12e74531b06-113e375f-1000; show_sponsored_styles=1; __utma=48425145.223224461.1309003285.1309003285.1309003285.1; ljuniq=ufTCu246eDSh5R5%3A1310599401%3Apgstats0; __qca=P0-1371400743-1309003285157; cart=; mlm-alert-start=; show_sponsored_vgifts=0; ljsession=v1:u1359334:s1658:t1314403200:g3511fa97d76eb492f381533c9bc3f387de3f3059//1
Keep-Alive: 300
------------------
I don't know why wget splits the URL into "Host" and the rest. I don't know if that is significant or not.
What I do think is significant is that wget sends ONE cookie that LWP doesn't send: ljdomsess.tptigger
Which I think is the very cookie you were talking about.
I do know that LWP loads that cookie; I've checked. It just doesn't send it, and I have no idea as to why. All the documentation about LWP and cookies says that you give LWP a cookie-jar and stand back and let LWP take care of it. But it isn't taking care of it. 8-(
no subject
Date: 2011-09-13 05:06 am (UTC)That makes me wonder whether LWP is connecting through a proxy: the only case I've seen where GET uses a full URL is if you're telling a proxy, rather than the final server, which URL to get (the direct case uses only the partial URL on the GET line and the hostname in a separate Host: header - at least with HTTP 1.1).
...wait, LWP is using HTTP 0.9? That seems wrong, too. I would have expected "HTTP/1.0" or (preferably) "HTTP/1.1" at the end of the "GET" line.
Is this debugging output from LWP or is it the "real thing" (captured from the network somehow, for example)? Perhaps it's just LWP trying to be helpful in its debugging output and not telling you the exact string its sending?
no subject
Date: 2011-09-13 11:27 am (UTC)Is this debugging output from LWP or is it the "real thing" (captured from the network somehow, for example)? Perhaps it's just LWP trying to be helpful in its debugging output and not telling you the exact string its sending?
Oh Lord, I hope not!
no subject
Date: 2011-09-10 11:29 pm (UTC)Ah. An interesting difference between the cookies themselves!
Set-Cookie3: ljdomsess.tptigger=v1%3Au51744%3As116%3At1315522800%3Agae77158fdd7a4af71af2431af4d34159ee085152%2F%2F; path="/"; domain=tptigger.dreamwidth.org; expires="2012-01-04 17:08:18Z"; version=0
Set-Cookie3: ljdomsess.tptigger="v1:u1359334:s1658:t1315566000:g4eff918a17f6712f1c4b55b3ee9e0102f99b6372//1"; path="/"; domain=.tptigger.livejournal.com; expires="2012-01-04 17:08:18Z"; version=0
Note that livejournal domain is stored as ".tptigger.livejournal.com", while the dreamwidth domain is stored as "tptigger.dreamwidth.org" -- is this the vital difference?
Would that be enough to prevent LWP matching the domain, and thus assuming that that cookie is not associated with that URL?
no subject
Date: 2011-09-11 02:47 pm (UTC)Any way to test?
no subject
Date: 2011-09-11 06:21 pm (UTC)\o/
Pity it's a hack.
8-(
no subject
Date: 2011-09-12 09:32 am (UTC)no subject
Date: 2011-09-13 05:07 am (UTC)That does seem off to me - as if the dot before "tptigger" is required in a URL for the cookie to match, and there is no dot before it in http://tptigger.livejournal.com .
no subject
Date: 2011-09-13 11:25 am (UTC)no subject
Date: 2011-09-10 03:29 pm (UTC)Perhaps they have done something that specifically blocks the LWP user agent. If that's the case, you could see if changing the user agent string helps. Good luck!
no subject
Date: 2011-09-10 11:19 pm (UTC)no subject
Date: 2012-03-20 04:45 pm (UTC)Which is super, as my fluency with perl is about on the level "Oh shit, hashes contain *references* to arrays? You can do that?"
no subject
Date: 2012-03-20 09:12 pm (UTC)