On Fri, 5 Feb 2010 22:05:27 +0100 Gisle Aas <gisle@aas.no> wrote:
> http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding > specifies how to pre-scan an HTML document to sniff the charset. > Would it not be simpler to just implement the algorithm as specified > instead of using a generic parser. The use of HTML::Parser to > implement this sniffing was just me trying a shortcut since > HTML::Parser seemed to implement a superset of these rules.
Those rules look somewhat involved to me, especially knowing that we already have both XS and Pure Perl parsers at hand.
Two thoughts:
1. What about using HTML::Encoding, after adapting it so it has only conditional dependency on HTML::Parser, and only uses HTML::Parser if available. (It already tries several detection methods before getting to HTML::Parser):
http://search.cpan.org/~bjoern/HTML-Encoding/
A variation on this idea would for *it* to a pure Perl HTML parser instead of skipping the HTML parsing check completely.
2. I note this from the spec page you reference:
"This algorithm is a willful violation of the HTTP specification, which requires that the encoding be assumed to be ISO-8859-1 in the absence of a character encoding declaration to the contrary, and of RFC 2046, which requires that the encoding be assumed to be US-ASCII in the absence of a character encoding declaration to the contrary. This specification's third approach is motivated by a desire to be maximally compatible with legacy content. [HTTP] [RFC2046]"
According to this, we can skip all this encoding-detection work and still be HTTP spec compliant (although it might more user-friendly to keep trying to guess. )
http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding specifies how to pre-scan an HTML document to sniff the charset. Would it not be simpler to just implement the algorithm as specified instead of using a generic parser. The use of HTML::Parser to implement this sniffing was just me trying a shortcut since HTML::Parser seemed to implement a superset of these rules.
I now have working code published which allows HTTP::Message to work without the dependency on HTML::Parser. This is useful because it's a step towards splitting out some of the HTTP modules into their own distribution which does not have this dependency, which in turn depends on a C compiler. So, this project could help allow parts of LWP to be used in places where a C compiler is not available, or when it would be more convenient to distribute one code line that could be used directly on multiple architectures. ( But this is not the only the use of HTML::Parser by the distribution. LWP::UserAgent makes use of HTML::HeadParser which in turn uses HTML::Parser. )
The solution passes the numerous existing tests for charset detection, as well as a new one I added.
However, I'm not yet recommending that the work be merged because the approach is not clean.
Essentially I have have embedded a fairly full-featured Pure Perl HTML parser into HTTP::Message. :) The code was taken from my fork of the "HTML::Parser::Simple" project and specialized some for this case:
http://github.com/markstos/html--parser--simple
I think a cleaner approach would be to publish this Pure Perl HTML parser, and then have an option to use it if HTML::Parser is not available.
A little history about HTML::Parser::Simple:
Ron Savage created the project based on the htmlparser.js JavaScript parser by John Resig. This branch was not trying to be particurly compatible with anything. It defines a new API. In particular, it bundles a parse tree *consumer*, Tree::Simple, as well as parse tree producer.
I forked the project and made some incompatible changes to pursue a different goal: Create a pure Perl HTML parser that is compatible with the HTML::Parser API. Or specifically, I wanted emulate the HTML::Parser 2.x API enough so that my parser could be used in place of it with HTML::FillInForm. My work met that goal-- it can be used to pass all HTML::FillInForm tests with some minor failures that I don't think matter.
This new case of parsing meta tags is another specialized use of the parser that gives me another reason to publish the work.
Here's the problem: While I care about these specific goals for an HTML::Parser that is "compatible enough", I'm not really interested in personally pursuing the idea of a Pure Perl HTML::Parser that is 100% compatible with HTML::Parser just for the sake of it. In short, I'm sure there will be change requests beyond what the uses I care about, and I'm not interested in maintaining the module to extend it for other uses.
There's also the matter of what to name it, since a version of HTML::Parser::Simple already exists. There's always HTML::Parser::PP or HTML::Parser::PurePerl, but those names just invite the idea that the goal is to be 100% compatible with HTML::Parser.
I'll discuss the matter further with Ron Savage to get his thoughts.
Feedback on the topic from other LWP users is welcome.
> > I took an interest in the issue of HTTP header ordering and researched > > what several other Perl modules do in regards to this as well as Ruby's > > Rack. I published the result on my blog: > > > > http://mark.stosberg.com/blog/2010/01/generating-http-headers-sorted-or-unsorted.html > > > > The summary is that I support the option for unsorted headers in > > HTTP::Headers. Michael Greb made a good case for it, and the > > possibility for a performance improvement is attractive too. > > I would prefer if there was a way to make the sorted headers as fast > as unsorted headers :-)
I have idea which might work for this, which is different than the approaches used in HTTP::Headers::Fast. I can try some experiments privately and report back if it turns out to be workable approach.
> Instead of introducing the 'as_string_without_sort' method could we > achieve the same effect with a 'order' argument to 'as_string'? Could > take values like 'sorted'/'original'/'dontcare'.
I think that would work equally well, and also allows for backwards compatibility.
On Tue, Jan 26, 2010 at 17:34, Mark Stosberg <mark@summersault.com> wrote: > > In 2008 there was some discussion about an option to preserve the > ordering of HTTP headers. Part of that thread is quoted below. > > The idea resurfaced in another form with the release of > HTTP::Headers::Fast, which provided a method to get back the the > headers unsorted. However, the motivation was different there-- > performance-- and the implementation as different as well. It returns > headers in essentially random order instead the order in which which > they were created or transmitted. > > I took an interest in the issue of HTTP header ordering and researched > what several other Perl modules do in regards to this as well as Ruby's > Rack. I published the result on my blog: > > http://mark.stosberg.com/blog/2010/01/generating-http-headers-sorted-or-unsorted.html > > The summary is that I support the option for unsorted headers in > HTTP::Headers. Michael Greb made a good case for it, and the > possibility for a performance improvement is attractive too.
I would prefer if there was a way to make the sorted headers as fast as unsorted headers :-)
I still would like to see support for the ordering of headers preserved at some point.
Instead of introducing the 'as_string_without_sort' method could we achieve the same effect with a 'order' argument to 'as_string'? Could take values like 'sorted'/'original'/'dontcare'.
--Gisle
> On Sun, 7 Sep 2008 15:53:46 +0200 > "Gisle Aas" <gisle@aas.no> wrote: > >> On Sun, Sep 7, 2008 at 1:49 PM, Michael Greb <mgreb@linode.com> wrote: >> > On Sep 5, 2008, at 7:23 PM, Gisle Aas wrote: >> >> >> >> True; and in this case we need to define what happens when fields are >> >> modified with 'push', 'set' or 'init' and 'remove' as that's the API >> >> that modify stuff. Let me suggest the following definition of the >> >> behaviour: >> >> >> >> - 'push' always append the field at the end of all headers. multiple >> >> occurrences of a field name do not have to be consecutive. >> >> >> >> - 'init' either does nothing or it works like 'push'. >> >> >> >> - 'remove' will always remove all concurrences of a field. >> >> >> >> - 'set' will work like 'push' if no other occurrence of the field exists. >> >> >> >> - 'set' will update the first occurrence if the field exists (and >> >> remove all other occurrences). if multiple field values is provided >> >> with 'set' they are basically all injected at the location of the >> >> first existing value. >> > >> > >> > On Sep 6, 2008 at 2:57 AM, Gisle Aas wrong: >> >> >> >> I think it makes sense to be able to enable them separately. >> >> Suggested interface: >> >> >> >> $h->scan(\&cb, original_order => 1, original_case => 1); >> >> $h->as_string(eol => "\n", original_order => 1, original_case => 1);' >> > >> > The attached patch uses the interface above and works towards the behavior >> > outlined in the first message. Due to the headers being stored as a hash, >> > pushing does not currently preserve previous values, second and subsequent >> > pushes of the same header will overwrite the previous value. Supporting >> > this would require a change in how the headers are stored within the module. >> > Your thoughts? >> >> I think it's better to just use your original approach and just keep >> the representation like used to be with the addition of an array that >> records the original field names and their order. This should lead to >> a smaller patch as the only thing that need to change is the code that >> sets headers and the scan method. I also like header lockups to be >> efficient and the representation compact. >> >> > Server: Fool/1.0 >> > content-encoding: gzip >> > Content-Type: text/plain; charset="UTF-8" >> > Content-Encoding: base64 >> > Date: Fri Sep 5 10:24:37 CEST 2008 >> > >> > Would be stored as (assuming push_header): >> >> My suggestion would be: >> >> bless { >> "content-encoding" => ["\n gzip", "base64"], >> "content-type" => "text/plain; charset=\"UTF-8\"", >> "date" => "Fri Sep 5 10:24:37 CEST 2008", >> "server" => "Fool/1.0", >> "::original_fields" => [ >> "Server", >> "content-encoding", >> "Content-Type", >> "Content-Encoding", >> "Date", >> ], >> }, "HTTP::Headers"; >> >> The invariant that needs to hold is that there is the same number of >> elements in {"::original_fields"} as there are values for all the >> others keys. >> >> Pushing a value is trivial; only change from what we have now is >> appending the original field name to {"::original_fields"}. >> >> The only state modification operation that becomes more complex is >> setting of a value header value. It has to: >> >> - update the values in the hash as before >> - locate the first occurence of the field name in >> {"::original_fields"} => $idx >> - remove all other occurrences of the field name >> - splice(@{"::original_fields"}, $idx, 1, ($orig_field_name) x >> $numbers_of_values_set); >> >> When 'scan' wants to iterate over the original headers it would have >> to keep an index into the values array for each field that repeat. >> >> An more compact representation could be to store {"::original_fields"} >> as a ":"-separated string; but we can think about that optimization >> later. >> >> --Gisle >> > > > -- > . . . . . . . . . . . . . . . . . . . . . . . . . . . > Mark Stosberg Principal Developer > mark@summersault.com Summersault, LLC > 765-939-9301 ext 202 database driven websites > . . . . . http://www.summersault.com/ . . . . . . . . > > >
I'm not comfortable with this patch. Some reasons:
- it's one big patch (actually there is 2) - it does not follow the layout style in use, e.g. introduces cuddled elses - it does various whitespace reformats - introduces ways that are not optimizations (like replacing $op eq "GET" with $op == $OP_GET) - introduces unused functions, e.g. _header_push_no_return
I do like to see performance improvements, so I would not mind smaller patches with demonstrated good effects. Having HTTP::Headers::Fast as a benchmark to beat it good. I don't consider all the benchmarks in [1] valid. I don't think the speed of pushing thousands of values onto a header important. The speed of getting and setting single values are.
I'm also slightly annoyed by the HTTP::Headers::Fast author for copying my code and then claiming[2] he wrote it.
On Tue, Jan 26, 2010 at 17:38, Mark Stosberg <mark@summersault.com> wrote: > > I've now published some patches in "git" that port the performance > improvements of HTTP::Headers::Fast back to HTTP::Headers: > > http://github.com/markstos/libwww-perl/commits/http-headers-fast > > The changes benchmark to be 10 to 20% faster on average and pass all of > the HTTP::Headers regression tests. > > As I just mentioned in a previous post, it also adds a new method to > generate the headers in an unsorted order, for better performance. The > behavior of "as_string()" is not changed. > > Mark
On 1 Feb 2010, at 16:34, Stanisław T. Findeisen wrote:
> Dirk-Willem van Gulik wrote: > > On 1 Feb 2010, at 14:54, Stanisław T. Findeisen wrote: > > > >> HTTPS_CA_FILE ... > > > > If I recall correctly (and this may be a few years out of date) - this only works if you are relying on Net::SSL as the underlying SSL library. It aint work with IO::Socket::SSL. > > Good. How to use it? I thought "use LWP::UserAgent;" does the job? : http://search.cpan.org/~dland/Crypt-SSLeay-0.57/SSLeay.pm
They bubble up through there; though the latter supports SSL_ca_file and SSL_ca_path.
I think that the structure under the covers is Crypt::SSLeay Net::SSL or Net::SSLeay Net::SSLeay::Handle IO::Socket::SSL Net::Server::Proto::SSL
Dirk-Willem van Gulik wrote: > On 1 Feb 2010, at 14:54, Stanisław T. Findeisen wrote: > >> HTTPS_CA_FILE ... > > If I recall correctly (and this may be a few years out of date) - this only works if you are relying on Net::SSL as the underlying SSL library. It aint work with IO::Socket::SSL.
Good. How to use it? I thought "use LWP::UserAgent;" does the job? : http://search.cpan.org/~dland/Crypt-SSLeay-0.57/SSLeay.pm
On 1 Feb 2010, at 14:54, Stanisław T. Findeisen wrote:
> HTTPS_CA_FILE ...
If I recall correctly (and this may be a few years out of date) - this only works if you are relying on Net::SSL as the underlying SSL library. It aint work with IO::Socket::SSL.
Thanks,
Dw.
http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
LWP version: 5.813 SSL_connect:before/connect initialization SSL_connect:SSLv3 write client hello A SSL_connect:SSLv3 read server hello A SSL_connect:SSLv3 read server certificate A SSL_connect:SSLv3 read server done A SSL_connect:SSLv3 write client key exchange A SSL_connect:SSLv3 write change cipher spec A SSL_connect:SSLv3 write finished A SSL_connect:SSLv3 flush data SSL_connect:SSLv3 read finished A Status: 200 OK issuer : /C=US/O=Equifax/OU=Equifax Secure Certificate Authority subject : /C=US/O=sourceforge.net/OU=3754508056/OU=See www.geotrust.com/resources/cps (c)09/OU=Domain Control Validated - QuickSSL(R)/CN=sourceforge.net cipher : RC4-MD5
If I, however, connect to a local site with self-signed certificate I get this:
SSL_connect:before/connect initialization SSL_connect:SSLv3 write client hello A SSL_connect:SSLv3 read server hello A SSL3 alert write:fatal:bad certificate SSL_connect:error in SSLv3 read server certificate B SSL_connect:before/connect initialization SSL_connect:SSLv2 write client hello A SSL_connect:failed in SSLv2 read server hello A Status: 500 SSL negotiation failed:
are ineffective? I am setting $ENV{HTTPS_CA_DIR} to '/var/log/' so that it is set to something valid but with no certificates. Setting this to undef or skipping this line doesn't help.
What's wrong? (Wells_Fargo_Root_CA.pem doesn't look like Equifax.) Am I using Crypt::SSLeay? How to know that?
I know 5.813 is not the newest version but this is the one in the current Debian GNU/Linux distro...
Best wishes, Yun-an >Yun-an, > >The test failed was a live test, meaning it ran against a live website >and could have failed for a reason caused by the server rather than >your software. It is probably save to ignore. You can just for a "make >install" to skip it. > > Mark
On Thu, 28 Jan 2010 15:51:47 +0530 bipin Nayak <nbipin78@gmail.com> wrote:
> Thanks for adding me to this group. > > Following is the script and result I am getting:-
Bipin,
I tried the script as you gave it and it gave the source of the login page as the result, *not* a 301. Are you using the latest versions of LWP::UserAgent and WWW::Mechanize? I just downloaded the latest versions from CPAN.
On Wed, 27 Jan 2010 12:47:39 +0100 Yun-an Yan <yun-an.yan@uni-rostock.de> wrote:
> Dear All, > > I cannot pass the test when I try to install Frameready-1.020. > Would somebody please help me?
Yun-an,
The test failed was a live test, meaning it ran against a live website and could have failed for a reason caused by the server rather than your software. It is probably save to ignore. You can just for a "make install" to skip it.
I cannot pass the test when I try to install Frameready-1.020. Would somebody please help me?
$ perl -v
This is perl, v5.8.9 built for darwin-2level
$ uname -a Darwin *.*.*.* 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386
$ make test /opt/local/bin/perl t/TEST 0 base/ua...........ok html/form.........ok local/autoload....ok local/frames......ok local/http........ok local/protosub....ok live/cpan.........Server closed connection without sending any data back at /opt/local/lib/perl5/site_perl/5.8.9/Net/HTTP/Methods.pm line 345. live/cpan.........dubious Test returned status 255 (wstat 65280, 0xff00) DIED. FAILED tests 1-2 Failed 2/2 tests, 0.00% okay Failed Test Stat Wstat Total Fail List of Failed ------------------------------------------------------------------------------- live/cpan.t 255 65280 2 3 1-2 Failed 1/7 test scripts. 2/46 subtests failed. Files=7, Tests=46, 6 wallclock secs ( 0.74 cusr + 0.15 csys = 0.89 CPU) Failed 1/7 test programs. 2/46 subtests failed. make: *** [test] Error 255
The changes benchmark to be 10 to 20% faster on average and pass all of the HTTP::Headers regression tests.
As I just mentioned in a previous post, it also adds a new method to generate the headers in an unsorted order, for better performance. The behavior of "as_string()" is not changed.
In 2008 there was some discussion about an option to preserve the ordering of HTTP headers. Part of that thread is quoted below.
The idea resurfaced in another form with the release of HTTP::Headers::Fast, which provided a method to get back the the headers unsorted. However, the motivation was different there-- performance-- and the implementation as different as well. It returns headers in essentially random order instead the order in which which they were created or transmitted.
I took an interest in the issue of HTTP header ordering and researched what several other Perl modules do in regards to this as well as Ruby's Rack. I published the result on my blog:
The summary is that I support the option for unsorted headers in HTTP::Headers. Michael Greb made a good case for it, and the possibility for a performance improvement is attractive too.
> On Sun, Sep 7, 2008 at 1:49 PM, Michael Greb <mgreb@linode.com> wrote: > > On Sep 5, 2008, at 7:23 PM, Gisle Aas wrote: > >> > >> True; and in this case we need to define what happens when fields are > >> modified with 'push', 'set' or 'init' and 'remove' as that's the API > >> that modify stuff. Let me suggest the following definition of the > >> behaviour: > >> > >> - 'push' always append the field at the end of all headers. multiple > >> occurrences of a field name do not have to be consecutive. > >> > >> - 'init' either does nothing or it works like 'push'. > >> > >> - 'remove' will always remove all concurrences of a field. > >> > >> - 'set' will work like 'push' if no other occurrence of the field exists. > >> > >> - 'set' will update the first occurrence if the field exists (and > >> remove all other occurrences). if multiple field values is provided > >> with 'set' they are basically all injected at the location of the > >> first existing value. > > > > > > On Sep 6, 2008 at 2:57 AM, Gisle Aas wrong: > >> > >> I think it makes sense to be able to enable them separately. > >> Suggested interface: > >> > >> $h->scan(\&cb, original_order => 1, original_case => 1); > >> $h->as_string(eol => "\n", original_order => 1, original_case => 1);' > > > > The attached patch uses the interface above and works towards the behavior > > outlined in the first message. Due to the headers being stored as a hash, > > pushing does not currently preserve previous values, second and subsequent > > pushes of the same header will overwrite the previous value. Supporting > > this would require a change in how the headers are stored within the module. > > Your thoughts? > > I think it's better to just use your original approach and just keep > the representation like used to be with the addition of an array that > records the original field names and their order. This should lead to > a smaller patch as the only thing that need to change is the code that > sets headers and the scan method. I also like header lockups to be > efficient and the representation compact. > > > Server: Fool/1.0 > > content-encoding: gzip > > Content-Type: text/plain; charset="UTF-8" > > Content-Encoding: base64 > > Date: Fri Sep 5 10:24:37 CEST 2008 > > > > Would be stored as (assuming push_header): > > My suggestion would be: > > bless { > "content-encoding" => ["\n gzip", "base64"], > "content-type" => "text/plain; charset=\"UTF-8\"", > "date" => "Fri Sep 5 10:24:37 CEST 2008", > "server" => "Fool/1.0", > "::original_fields" => [ > "Server", > "content-encoding", > "Content-Type", > "Content-Encoding", > "Date", > ], > }, "HTTP::Headers"; > > The invariant that needs to hold is that there is the same number of > elements in {"::original_fields"} as there are values for all the > others keys. > > Pushing a value is trivial; only change from what we have now is > appending the original field name to {"::original_fields"}. > > The only state modification operation that becomes more complex is > setting of a value header value. It has to: > > - update the values in the hash as before > - locate the first occurence of the field name in > {"::original_fields"} => $idx > - remove all other occurrences of the field name > - splice(@{"::original_fields"}, $idx, 1, ($orig_field_name) x > $numbers_of_values_set); > > When 'scan' wants to iterate over the original headers it would have > to keep an index into the values array for each field that repeat. > > An more compact representation could be to store {"::original_fields"} > as a ":"-separated string; but we can think about that optimization > later. > > --Gisle >
On Jan 3, 2010, at 22:22 , Father Chrysostomos wrote:
> I came across a bug in HTTP::Message::content_charset. It has a ‘local $_’, which is unnecessary, since foreach loops already localise their topic. In fact, they do it in a safer way that is more like local *_=\do{my$x}. Using local $_ causes problems if $_ is tied. The attached patch removes the offending line and adds a test for it. This did actually occur in real code, and is not just a theoretical problem. > <open_vsKSmGR8.txt>
Thanks. Applied as <http://github.com/gisle/libwww-perl/commit/0d33cd81894ed02f7005ca742206e90511338880>.
I came across a bug in HTTP::Message::content_charset. It has a ‘local $_’, which is unnecessary, since foreach loops already localise their topic. In fact, they do it in a safer way that is more like local *_= \do{my$x}. Using local $_ causes problems if $_ is tied. The attached patch removes the offending line and adds a test for it. This did actually occur in real code, and is not just a theoretical problem.
On Wed, Nov 11, 2009 at 06:34:24PM -0800, Ilya Zakharevich wrote: > A good example of the report of failure is > http://www.nntp.perl.org/group/perl.cpan.testers/2009/11/msg5926497.html > Essentially, the error message is > > Getting GP/PARI from ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/ > Cannot list (Illegal PORT command. > ): at utils/Math/PariBuild.pm line 319. > > Can't fetch file with Net::FTP, now trying with LWP::UserAgent... > Not in this directory, trying `ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/OLD/'... > then it shows that the response via LWP for ..../OLD is empty (of type > text/ftp-dir-listing), and shows that `ftp -pinegv' has no problem
I got the first non-Unix report with a similar failure. See http://www.nntp.perl.org/group/perl.cpan.testers/2009/11/msg6061071.html
It goes like this: =======================================================
Getting GP/PARI from ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/ Not in this directory, now chdir('OLD')... Can't use an undefined value as a symbol reference at C:/home/stro/perl5111/lib/Net/FTP/dataconn.pm line 54.
Can't fetch file with Net::FTP, now trying with LWP::UserAgent... Not in this directory, trying `ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/OLD/'... Can't fetch directory listing from ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/OLD/: 500 Can't use an undefined value as a symbol reference Content-Type: text/plain Client-Date: Thu, 19 Nov 2009 17:32:57 GMT Client-Warning: Internal response
500 Can't use an undefined value as a symbol reference
Hi,all. Although I have just fixed out the problem.Finally,I need to active 3 webpage successively. first,login:"theWebServer/main" second,toShowTimeSlot:"theWebServer/ShowTimeslot?parameters=selected" third, toMakeBooking:"theWebServer/bookingRecord?parameters=selected&etc" The first step is a must. What puzzled me is that I have to activate the third step by a simple access to the second webpage, although I have every parameter needed for the third web-request. I think maybe it writes something to the cookie during the access.
---------------------------- Original Message ---------------------------- Subject: Re: UserAgent->get problem From: "Keary Suska" <hierophant@pcisys.net> Date: Sun, November 15, 2009 11:14 pm To: "Libwww Perl" <libwww@perl.org> --------------------------------------------------------------------------
On Nov 15, 2009, at 1:29 AM, SHANG Yuan wrote:
> get the window.open "url" from the second webpage(mentioned above),paste > it to the web-browser, and get the same failure result as I get by perl > script. But after I have clicked the "pic.gif" in the second webpage from > the web-browser, I re-enter the "url" of the third webpage and this time > the browser returned just what I wanted. > It seems that the "click" action have activated something I haven't noticed. > > Have anyone here come with the similar problems before? > Any advice will be appreciated.
> get the window.open "url" from the second webpage(mentioned above),paste > it to the web-browser, and get the same failure result as I get by perl > script. But after I have clicked the "pic.gif" in the second webpage from > the web-browser, I re-enter the "url" of the third webpage and this time > the browser returned just what I wanted. > It seems that the "click" action have activated something I haven't noticed. > > Have anyone here come with the similar problems before? > Any advice will be appreciated.
Thanks,Gisle. My access to the web-server can divide into 3 steps: 1.login. Succeed.I get the same content using UserAgent as I login through the web-browser. 2.ToABookingSystem Succeed. This second webpage contains some javaScripts.And from the web-browser, it turns to the thrid webpage after clicking one link. The following is the corresponding web-page code:
3.Submit my request. This third webpage comes from the second webpage. I get the window.open "url" from the second webpage(mentioned above),paste it to the web-browser, and get the same failure result as I get by perl script. But after I have clicked the "pic.gif" in the second webpage from the web-browser, I re-enter the "url" of the third webpage and this time the browser returned just what I wanted. It seems that the "click" action have activated something I haven't noticed.
Have anyone here come with the similar problems before? Any advice will be appreciated.
Yuan SHANG
> On Sat, Nov 14, 2009 at 16:58, SHANG Yuan <spl@ust.hk> wrote: >> I want to submit my request to a website, and I successfully parse the >> "get" parameters. >> When I paste the >> url('http://mywebsite/search?aa=1&bb=2&JSEnable=true') >> to firefox, it returns a successful value. However, >> $response=$ua->get($url); >> did not return the successful value. >> Did any kind spirit help me figure it out? > > You need to provide more information about the actual server you try > to connect to in order to get any real advice. My guess is that the > server cares about the User-Agent or Cookie headers. Sniffing the > traffic that the browser generates when talking to the server might be > instructive. > > --Gisle > > >> >> By the way,the website need username and password.I have use the >> statements below to deal with it. And It seems work well, since I can >> get the content of the webpages without problems. >> >> my $response = $ua->post( $url, >> [ loginID=>$username, >> passwd => $passwd, >> ] >> ); >> >> >> >
On Sat, Nov 14, 2009 at 16:58, SHANG Yuan <spl@ust.hk> wrote: > I want to submit my request to a website, and I successfully parse the > "get" parameters. > When I paste the url('http://mywebsite/search?aa=1&bb=2&JSEnable=true') > to firefox, it returns a successful value. However, > $response=$ua->get($url); > did not return the successful value. > Did any kind spirit help me figure it out?
You need to provide more information about the actual server you try to connect to in order to get any real advice. My guess is that the server cares about the User-Agent or Cookie headers. Sniffing the traffic that the browser generates when talking to the server might be instructive.
--Gisle
> > By the way,the website need username and password.I have use the > statements below to deal with it. And It seems work well, since I can > get the content of the webpages without problems. > > my $response = $ua->post( $url, > [ loginID=>$username, > passwd => $passwd, > ] > ); > > >
Hi,all. I want to submit my request to a website, and I successfully parse the "get" parameters. When I paste the url('http://mywebsite/search?aa=1&bb=2&JSEnable=true') to firefox, it returns a successful value. However, $response=$ua->get($url); did not return the successful value. Did any kind spirit help me figure it out?
By the way,the website need username and password.I have use the statements below to deal with it. And It seems work well, since I can get the content of the webpages without problems.
No idea what the problem is offhand. I'll try to find some time to debug this during the weekend.
--Gisle
On Thu, Nov 12, 2009 at 03:34, Ilya Zakharevich <nospam-abuse@ilyaz.org> wrote: > There are many "yellow" reports for Math::Pari in the smoke test > database. They come from failures to download C source code for the > library Math::Pari need. Net::FTP and LWP fail (kinda mysteriously); > usual `ftp -pinegv' I run for debugging purposes succeeds. > > The recent versions of Math::Pari have some code to debug these > failures; however, the debugging output says nothing to me (IIRC, most > of the code to auto-fetch was contributed). Any help is appreciated. > > Thanks, > Ilya > > ======================================================= > A good example of the report of failure is > http://www.nntp.perl.org/group/perl.cpan.testers/2009/11/msg5926497.html > Essentially, the error message is > > Getting GP/PARI from ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/ > Cannot list (Illegal PORT command. > ): at utils/Math/PariBuild.pm line 319. > > Can't fetch file with Net::FTP, now trying with LWP::UserAgent... > Not in this directory, trying `ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/OLD/'... > > then it shows that the response via LWP for ..../OLD is empty (of type > text/ftp-dir-listing), and shows that `ftp -pinegv' has no problem > getting the listings (for both directories) and the file. > > The code which emits these messages is in > http://cpansearch.perl.org/src/ILYAZ/Math-Pari-2.0304_00108060102/utils/Math/PariBuild.pm > > starting from > require Net::FTP; > >
There are many "yellow" reports for Math::Pari in the smoke test database. They come from failures to download C source code for the library Math::Pari need. Net::FTP and LWP fail (kinda mysteriously); usual `ftp -pinegv' I run for debugging purposes succeeds.
The recent versions of Math::Pari have some code to debug these failures; however, the debugging output says nothing to me (IIRC, most of the code to auto-fetch was contributed). Any help is appreciated.
Thanks, Ilya
======================================================= A good example of the report of failure is http://www.nntp.perl.org/group/perl.cpan.testers/2009/11/msg5926497.html Essentially, the error message is
Getting GP/PARI from ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/ Cannot list (Illegal PORT command. ): at utils/Math/PariBuild.pm line 319.
Can't fetch file with Net::FTP, now trying with LWP::UserAgent... Not in this directory, trying `ftp://megrez.math.u-bordeaux.fr/pub/pari/unix/OLD/'...
then it shows that the response via LWP for ..../OLD is empty (of type text/ftp-dir-listing), and shows that `ftp -pinegv' has no problem getting the listings (for both directories) and the file.
The code which emits these messages is in http://cpansearch.perl.org/src/ILYAZ/Math-Pari-2.0304_00108060102/utils/Math/PariBuild.pm
I am using a UserAgent with a callback to handle a streaming connection. How does one terminate the connection client-side ? How do you close the connection other than exiting the program ?
Bill Moseley schrieb: > Now, if you for some reason really want to encode to latin1,
Actually, I need to encode to the charset (encoding) that is used in the document fetched by LWP::UserAgent. I don't mind if that is UTF-8.:-) But oftenly it is latin1.
I'm trying to learn from this, too. Please (anyone) correct me if I'm wrong below.
On Thu, Oct 15, 2009 at 11:18 AM, Oliver Block <lists@oliver-block.eu>wrote:
> > I think I've found out what causes the problem. As I mentioned earlier > the content of a td tag in my case "» Kontakt › > Kontaktformular" will be represented by the following ... characters > (?) "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular" and the reason seems > to be that there is nothing like a character representation in the > ISO-8859-1 encoding. The codepoint (for ›) is U+203A or › > This seems to be a legal character in ISO-8859-1-encoded html documents > when it appears in the form of a character entity reference. >
Well, I think you are slightly mixing things there. But, it's probably more about terminology.
The 8 letters and symbols that make up "›" are all valid ISO-8859-1 code points. The character that it represents is not an ISO-8859-1 character. One point of the entity is to allow the browser to render characters that are not in the encoding used to transmit the document from the server to the browser.
What I think is happening in your case is when parsing the *entities* they end up as wide characters so Perl has to promote the text to a wide character -- that is it's setting the utf8 flag on the data so that Perl can represent the *character*.
Now, this won't happen if you don't have entities (well entities that represent wide characters). If, for example, you have just uft8 characters in the web page you are parsing and don't decode it (which I consider a programming error) then you won't end up with the utf8 flag on. That is, you have octets instead of characters inside Perl.
And w/o the utf8 flag set you won't get "wide character in print" errors, either, so you don't even know you are doing it wrong. ;)
Bill Moseley schrieb: > So, in general, I would bring character data into Perl like: > > my $characters = $response->decoded_content; > > Then you work with $characters as needed. > > And then when you want to output you convert back to whatever encoding > you need: > > $utf8_octets = encode_utf8( $characters ); > > send_to_client( $utf8_octets ); > > For your case you might try $tree->parse( $response->decoded_content > ); Or, if you have raw utf-8 octets that you need to parse I think > you can call $tree->utf8_mode( 1 ) to tell the parser to decode. But, > I'd prefer the first. > That seems to be a good idea. There are only some modifications I have to make, because there is not always the same encoding for incoming documents. It can be latin1 or utf-8 or others. Those who create the web pages are not always that precise. That's why HTML::Parser is such a good choice in this cases, because it is tolerant.
I thought that not touching the encoding would be the best idea, but decoding characters with code points higher than 255 seems to be better. But it might also a good idea to use $response->decoded_content and later encode the content again. At least if $response provides always for an ->content_charset.
Oliver Block schrieb: > (You will find the perl code at the end) > > A close look to the dump of $tree and a comparison with > $response->content showed the following: > > The following markup from $response->content > > <td colspan="8" align="left" bgcolor="#FFFFFF" class="Rubrik">» > Kontakt › Kontaktformular</td> > > appears in tree as > > bless( { > '_parent' => > $VAR1->{'_content'}[1]{'_content'}[0]{'_content'}[1]{'_content'}[5], > '_content' => [ > "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular" > ], > 'colspan' => '8', > 'align' => 'left', > 'bgcolor' => '#FFFFFF', > '_tag' => 'td', > 'class' => 'Rubrik' > }, 'HTML::Element' ) > > If you have any idea how to avoid the conversion to utf8 and how to > assure the the output of $tree->as_HTML() can be saved in the same > encoding as stated in $response, please tell it. > > I think I've found out what causes the problem. As I mentioned earlier the content of a td tag in my case "» Kontakt › Kontaktformular" will be represented by the following ... characters (?) "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular" and the reason seems to be that there is nothing like a character representation in the ISO-8859-1 encoding. The codepoint (for ›) is U+203A or › This seems to be a legal character in ISO-8859-1-encoded html documents when it appears in the form of a character entity reference.
So, changing the parameter for as_HTML from
$tree->as_HTML('<>&');
to
$tree->as_HTML();
solves the problem because now all "unsafe" characters (e.g. "\x{203a}") are encoded as entities within as_HTML(). Therefore there is no need for perl to encode the complete string to UTF-8 when using join() (see code at the end). That's at least what perluniintro mentions:
"Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8." (perldoc perluniintro)
That's at least how I make sense of it.
Best regards,
Oliver Block
> Oliver Block schrieb: > >> Hello everyone, >> >> the following code is used to load a web page from a certain web server >> and parse it into an html tree. At the end a variable is assigned the >> string representation of that tree. >> >> use LWP::UserAgent; >> use HTML::TreeBuilder; >> >> my $ua = LWP::UserAgent->new; >> my $response = $ua->get($form->{'url'}); >> >> my $tree = HTML::TreeBuilder->new(); >> $tree->parse($response->content); >> >> # ... >> # encoding of content of $tree is ISO-8859-1 at this point >> $template = $tree->as_HTML('<>&'); >> >> # encoding of content of $template is UTF-8 >> >> Now the following problem arises. The encoding of the content of >> $template (UTF-8) is not the same than the content of $tree >> (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to UTF-8. >> >> I debugged everything and everythings is fine up to the last line of code of sub HTML::Element::as_HTML which is: >> >> return join('', @html, "\n"); >> >> This would mean that join seems to modify the encoding of the content. >> >> Any suggestions? >>
On Wed, Oct 14, 2009 at 6:00 PM, Oliver Block <lists@oliver-block.eu> wrote:
> > my $ua = LWP::UserAgent->new; > my $response = $ua->get($form->{'url'}); > > my $tree = HTML::TreeBuilder->new(); > $tree->parse($response->content); > > # ... > # encoding of content of $tree is ISO-8859-1 at this point > $template = $tree->as_HTML('<>&'); > > # encoding of content of $template is UTF-8 > > Now the following problem arises. The encoding of the content of > $template (UTF-8) is not the same than the content of $tree > (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to > UTF-8. >
I'm not really sure what the problem is, sorry. But, the terminology above seems a bit off.
UTF-8 and ISO-8859-1 are encodings (encoded octets) not characters. Characters are an abstractions. You should use character's inside Perl and encoded octets outside. (Ignore the fact that Perl's internal encoding is UTF-8 and just pretend they are character abstractions.)
So, in general, I would bring character data into Perl like:
my $characters = $response->decoded_content;
Then you work with $characters as needed.
And then when you want to output you convert back to whatever encoding you need:
$utf8_octets = encode_utf8( $characters );
send_to_client( $utf8_octets );
For your case you might try $tree->parse( $response->decoded_content ); Or, if you have raw utf-8 octets that you need to parse I think you can call $tree->utf8_mode( 1 ) to tell the parser to decode. But, I'd prefer the first.
(One thing I'm not clear on is when or if the parsers detect encoding by looking for a charset in the content. XML::LibXML will use the <?xml encoding= from the content, for example. But I'm not clear if the HTML::Parser will look at an encoding set in a <meta> tag.)
I still do not understand why that happens but join does certainly not cause it.
If you have any idea how to avoid the conversion to utf8 and how to assure the the output of $tree->as_HTML() can be saved in the same encoding as stated in $response, please tell it.
Best Regards,
Oliver Block
Oliver Block schrieb: > Hello everyone, > > the following code is used to load a web page from a certain web server > and parse it into an html tree. At the end a variable is assigned the > string representation of that tree. > > use LWP::UserAgent; > use HTML::TreeBuilder; > > my $ua = LWP::UserAgent->new; > my $response = $ua->get($form->{'url'}); > > my $tree = HTML::TreeBuilder->new(); > $tree->parse($response->content); > > # ... > # encoding of content of $tree is ISO-8859-1 at this point > $template = $tree->as_HTML('<>&'); > > # encoding of content of $template is UTF-8 > > Now the following problem arises. The encoding of the content of > $template (UTF-8) is not the same than the content of $tree > (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to UTF-8. > > I debugged everything and everythings is fine up to the last line of code of sub HTML::Element::as_HTML which is: > > return join('', @html, "\n"); > > This would mean that join seems to modify the encoding of the content. > > Any suggestions? > > > Best Regards, > > Oliver Block > > >
the following code is used to load a web page from a certain web server and parse it into an html tree. At the end a variable is assigned the string representation of that tree.
use LWP::UserAgent; use HTML::TreeBuilder;
my $ua = LWP::UserAgent->new; my $response = $ua->get($form->{'url'});
my $tree = HTML::TreeBuilder->new(); $tree->parse($response->content);
# ... # encoding of content of $tree is ISO-8859-1 at this point $template = $tree->as_HTML('<>&');
# encoding of content of $template is UTF-8
Now the following problem arises. The encoding of the content of $template (UTF-8) is not the same than the content of $tree (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to UTF-8.
Is this behavior of as_HTML known? Will this be changed?
the following code is used to load a web page from a certain web server and parse it into an html tree. At the end a variable is assigned the string representation of that tree.
use LWP::UserAgent; use HTML::TreeBuilder;
my $ua = LWP::UserAgent->new; my $response = $ua->get($form->{'url'});
my $tree = HTML::TreeBuilder->new(); $tree->parse($response->content);
# ... # encoding of content of $tree is ISO-8859-1 at this point $template = $tree->as_HTML('<>&');
# encoding of content of $template is UTF-8
Now the following problem arises. The encoding of the content of $template (UTF-8) is not the same than the content of $tree (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to UTF-8.
I debugged everything and everythings is fine up to the last line of code of sub HTML::Element::as_HTML which is:
return join('', @html, "\n");
This would mean that join seems to modify the encoding of the content.
Can anyone confirm this problem before the webpage starts to respond again. On windows it fails after 22 seconds.On linux it fails after 189 seconds! Is this a linux problem or a perl on linux problem?Can something be done to avoid this long timeout?
500 Connect failed: connect: Connection timed out; Connection timed outtime used: 189 seconds _________________________________________________________________ Keep your friends updated—even when you’re not signed in. http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_5:092010
require LWP::UserAgent; require HTTP::Request; $request = HTTP::Request->new(GET => 'http://www.utm.edu/departments/finearts/calen.htm'); $ua = LWP::UserAgent->new(); $response = $ua->request($request); $content_charset = $response->content_charset(); ^D Can't use an undefined value as an ARRAY reference at C:/Perl/site/lib/HTTP/Message.pm line 251. Summary of my perl5 (revision 5 version 8 subversion 9)libwww version: 5.831 /Jesper Persson _________________________________________________________________ Windows Live™: Keep your life in sync. Check it out! http://windowslive.com/explore?ocid=TXT_TAGLM_WL_t1_allup_explore_012009
Sorry for the noise! I was not using the latest version of libwww. After upgrading the decoded_content works perfectly on the page. Thanks Gisle. Best RegardsJesper Persson > Date: Tue, 15 Sep 2009 17:08:49 +0200 > Subject: Re: decoded_content and page with content-type application/xhtml+xml; charset=utf-8 > From: gisle@aas.no > To: jjperss@hotmail.com > CC: libwww@perl.org > > On Tue, Sep 15, 2009 at 10:06, Jesper Jørgen Persson > wrote: >> >> message->decoded_content doesn't decode this page: >> http://www.silkeborgbibliotekerne.dk/om+bibliotekerne/kontakt/sp%c3%b8rg+bibliotekaren >> >> because the content-type is: application/xhtml+xmland decoded_content expects the content-type to start with "text" >> from message.pm: if ($ct && $ct =~ m,^text/,,) { > > What version of LWP is this? The code does not look like that any more. > >> according to this page: http://www.w3.org/TR/xhtml-media-types/#application-xhtml-xmlthe content-type should be correct. >> What should I do ? Manually do something like: $content_ref = \Encode::decode($charset, $$content_ref); > > My guess is that upgrading LWP will help. > > --Gisle
_________________________________________________________________ More than messages–check out the rest of the Windows Live™. http://www.microsoft.com/windows/windowslive/
Compiling My Own perl—brian d foy (first page)
FMTIEWTK About Closures—Johan Lodin (first page)
Expecting Perl—Mark Schoonover (first page)
Perl and Undecidability—Jeffrey Kegler (first page)
The Year in Perl, 2007—brian d foy (first page)
Templating My Output—Alberto Manuel Simões (first page)
Making My Own CPAN—brian d foy (first page)
Programming Parrot—Jonathan Scott Duff (first page)
Komodo Test Drive—Jim Brandt (first page)
Named Captures in Perl 5.9.5—brian d foy (first page)
Simple Web Access—Alberto Manuel Simões (first page)
Parrot Status Report—Jonathan Scott Duff (first page)
Mapping Op Codes—Eric Maki (first page)
CPANdeps—David Cantrell (first page)
HTML Slides—Grant McLean (first page)
Alter Egos—Anno Siegel (first page)
Found Perl is section of The Perl Review's website where we posted pictures of Perl paraphernalia or the word "Perl" in the wild. We started it a long time ago, and were very slow in updating it when people would send in photos.
Now "Found Perl" is a photo pool on Flickr. Instead of sending them to TPR, just add them to the pool.
The Perl Review's website has had a section on Schwern's Shirt, the orange monstrosity that brian d foy bought at the charity auction for The Perl Foundation at the 2004 Open Source Convention.
Now we've moved the section of the website to a Flickr group for Schwern's shirt. This way, anyone can add their photos of Schwern's shirt to the group. Instead of being infrequently updated on the website, people can add them as soon as they upload them to Flickr.
If you don't have a Flickr account and don't want to create one, you can still send them to The Perl Review by mailing them to editors@theperlreview.com.
David Pogue had a momentary lapse of judgement when he proclaims in his blog that the date sequence 01:02:03 04/05/06 will only happen once in all of human history.
Besides the obvious gaffes of date formatting (which one is the month and which one is the year?), the red herring of leading zeros (to make the minute and second stand out), and so on, no one who's seen this has made the comment that calendars say whatever we want them to say and the numbers are only special because we set the calendar up that way in this one case. What about the Chinese, Hebrew, and Muslim calendars?
So this seems like a good challenge to publish in The Perl Review: using the Perl Date modules (or not, I guess), in how many different calendars and formats can you make this sequence? What else is special about those days (are they a weekend, fall on a full moon, have a solar eclipse, etc.)?
Powell's Technical Books in Portland (that's the one on 33 NW Park Avenue) is going to carry The Perl Review. It's our first newsstand distribution.
I had to set a newstand price. The deal basically works like this: bookstores keep most of the profit. Magazines make money when the single-issue buyers turn into subscribers. After Powell's cut, which we set at 40%, and my costs, $2 an issue, I have to figure out a price that also motivates people to give the money to The Perl Review directly instead of the book store. That's why you see big discounts for magazines when you subscribe: that's the real price, and everything else is markup. The Powell's price ends up being $5, which is 50 cents more than the subscription price.
That's not to say that newsstands are bad. It's like better-than-free advertising since it sits on the shelf and I cover my costs plus a little more for every issue sold.
Forget about absolute numbers for a moment. At my price point, if they sell 75% of the copies, I break even. That would be fine with me because any copy sitting on the actual magazine display means people see that issue. Some might subscribe later even if they don't buy it. Now that I have a price point, I have to figure out the right number of issues. That's something I just have to guess.
I left 16 copies of the Spring 2005 issue, but I also have to consider that I sold about 10 at the Intermediate Perl book signing. We'll see how that goes.
Now, a good magazine accountant has to keep track of the actual number of newsstand sales too. As much as I'd like to pretend that we sell every single copy, the Post Office wants to know where all the issues went to verfiy that we abide by all the periodical rules. It's not enough for the newsstand to simply tell me what they sold. They certainly aren't going to tell me they sold everything when they didn't since that's money out of their pocket. They can't really tell me they sold nothing because that's money out of my pocket.
If you're a late night person living in a city, you might have seen a bunch of guys tearing off the front pages of newspapers and magazines. Instead of sending back the unused copies, they send back the cover (and they do that for books too). Every cover they don't send back is a sale that they owe me money for.
You might think those unsold issues represent lost money, but they really don't. They are a sunk cost, meaning that I would have spent that money regardless of the sales. That starts way back at the printers when I have to decide how many issues I want. That number includes all subscriptions, complimentary copies, samples to user groups, and all the issues I'll need to fulfill orders for back issues. Not only that, but the more copies I print, the lower the incremental cost (the cost per each copy). Each printing job has a fixed overhead for the job preparation, machine set-up, and so on. That's the make ready. I end up printing many more copies than I need, partly to amoritize the make ready. Not selling at the newsstand is slightly better than not selling while sitting in boxes in the office. At least people see them at the newsstand.
Remember that magazines make money on subscriptions, so that's the goal. I don't care about selling more at the newsstands. If someone subscribes because they see an issue on the newsstand, the profit from the subscription pays for about three unsold newsstand copies, so five subscriptions from people seeing the issue at Powell's would make up for no sells. That's just breaking even, and nobody makes any money. That also means I'm spending $6 to get a new subscriber.
If you're already despondent, you don't want to read about distributors. Most bookstores don't want to deal with every individual publisher. They'd have to keep track of a separate deal for every magazine. Instead, they want to deal with a single source in the same way they deal with books. I know my costs, and I know the newsstands cut, and I have a price point that I can't change to much because people won't buy it at too high a price. If I use a distributor, perhaps to get into the big chain book stores, they are going to want a big cut too. I'll end up either breaking even or losing money on every newsstand copy, and I'll want to convert that to a subscription as soon as possible. That's why you see so many wonderful subscription cards in the magazines.
So far I've just talked about money from sales. We can also sell advertising, which we do for the special friends of the Perl community. Since magazines know they are going to lose money at the newsstand, they make up the difference with paid advertising. Ever wonder why magazines such as Wired are mostly advertisements? That's making up for the money they'll lose on the newsstands. Remember when I talked about keeping track of the number of copies sold? Advertisers want to know those numbers. They don't care how many copies the newsstand bought. They care about the number of copies that shoppers bought. That sets the rate at which the magazines can sell ads. More eyeballs equal more dollars. There's a separate industry of companies that audit magazines to verify the numbers. That's even more money that gets sucked away.
The short story? Subscribe to the magazines you like. It's the only way they can survive.
Last year, I started including interviews with Perl people on The Perl Review website, but I didn't add a feed for that. Now I have and it's on RSS page.
The web site is actually a directory processed by Template Toolkit, so what I really need to do is add indexing support as ttree goes through the directories, then spit out pages as it does that. That sounds like a magazine article...
Every issue I get a couple of book reviews that don't quite cut it. Techies tend to present too many sides of the story: rather than express their opinion, they equivocate by pointing out all the holes in their own opinion. In short, they are entirely too nice.
Being nice isn't a bad thing, but what people really want out a book review is a recommendation. "Should I buy this book?" It doesn't matter if people agree with you as long as you are fair to the book. After that, people want to know the particulars: who, what, when, where, and how.
So far, The Perl Review has taken reviews from several different people, and very few people have provided multiple reviews. That worked when we were first getting started, but now I think we need something different. Since book reviews are about opinions, and the reader doesn't have to agree with the reviewer, I think readers need to know the reputation and leaning of the reviewer to make their own decision. For instance, my wife doesn't agree with movie-reviewer Roger Ebert, but based on his negative reviews she knows which movies she will like. At the same time, she knows the certain people on LiveJournal likes the same sorts of movies she does, so she can trust them.
In line with all that, I think I'm going to move towards recurring book reviewers, and do more to establish their reputation. That also means that I want to get a couple of reviewers who think about books differently so I can give more readers someone that thinks like them. I've approached a couple of people, and we'll see how it turns out.
I've added two guides I hadn't run across previously:
It seems that every time I put out a new issue I have to leave town for a couple weeks. I don't think the two are related. Travel is just too busy this year. Add to that finishing up the Alpaca book and I've had a pretty busy week. I'll be back in the middle of December.
I'm sending out the email announcements as I post this. Once you get the new password, you'll be able to download the latest issue of The Perl Review.
In this issue:
Seven Sins of Perl OO Programming -- chromatic
Hash Anti-Patterns -- Alberto Manuel Simões
Haskell for Perlers -- Frank Antonsen
PerlWar -- Yanick Champoux
books reviews, commentary, news and more...
Some of you neglected to renew, but I won't bug you in email anymore. To find out more about the sampler on the cover, you'll have to renew that subscription.
If you haven't subscribed yet, now's the time because I have to raise prices next year when the US postal rates go up.
The next issue of The Perl Review comes out in a couple of weeks, so it's time to renew if you've already finished your first year subscription!
This may seem odd for some who just recently subscribed, but with the magic of time travel we let you pretend that your subscription started much earlier by allowing you to choose which issue to start your subscription. You may have subscribed yesterday but selected to start with issue 1.0 (Winter 2004), meaning that your issue ran out with our last issue, 1.3 (Fall 2005).
I've found that most new subscribers like to get most of the back issues at first, and it's worked out rather well for us. Now that we're starting our second year, though, I have to figure out how to make that work so people don't have to renew right away. I'll have to add a two-year subscription, I think.
I've been reading lot about magazine renewal rates. A lot of publications seem to be happy to get 25% renewals, and they actually spend a fair bit of money to get that. Ever wonder why you get some many magazine renewal letters in your postal mail?
For TPR, since I don't have a lot of money, I simply used email. The target audience is certainly technology-adept, so that's not so bad. So far I've sent out three renewal emails. One was the week before OSCON, and I got about a 45% response for that. That's already good for magazines, but not quite the 98% renewal rate TPJ had in its first year. I sent out a second email two weeks after that, and another one over the weekend. I think I'm floating somewhere around 75% renewal right now.
I have a list of all of the people who haven't renewed (it's an SQL view ;), and a lot of them should have and I would be surprised if they don't. Although email provided me with a virtually free way to get that 75%, I also think email has a tendency to get lost. First, if the subscriber can't deal with it right awy, it joins the long queue of messages that get ignored forever. Second, it has a pretty good chance of being blocked by a spam filter. Third, some people might get so mucch mail that they just don't get to ever see it. I do send each mail individually since each contains a persoanl renewal code, so I'm not blasting out spam to any user at each domain.
I am considering sending out some postcards to the hold-outs. I figure that will cost me about $50, which means that I need about four renewals to make up for it. If I get four renewals, that's paid for. I'm also going to try emailing people directly. It certainly helps when other people talk about TPR. I notice a big spike in renewals when Ovid talked about his upcoming Logic Programming article.
There's another trick to getting renewals: special promotional subscriptions. Magazines give you a special rate for a limited time then hit you up for the full price as quickly as possible. It's all part of the ad-selling game. Advertisers want to know how many "qualified subscribers" you have, which means how many people actually want your magazine enough to pay for it. Getting people to be full subscribers raises those numbers.
TPR is a bit different because I don't aim to make it an audited magazine, meaning that no one is going to come in and put a stamp on our books saying they verified our subscriber base. I'm not in it to sell advertising (let's see if I'm saying the same thing in three years). So far I've only taken advertising from people the Perl community already loves and trusts. Most of that is just filling up the other full color pages that come along with the cover.
I now have a big stack of emails confirming renewal transactions, and I need to shove those transactions into my local database so people keep getting their magazines.
Most of them are pretty easy because they are keyed on their unique ID in the database. A lot of people skipped the link I sent them and went to the subscription page directly, so I have to match up those with what I see in te database.
So far it works like this and handles 95% of the records:
Get all the matches by name
With one match, we're probably done
With no match, try different parts of the name (last, first)
With multiple matches, we need to match something else too.
From the name matches, compare email addresses. If they match, we're done.
At this point we probably have several candidate matches.
Look at parts of the address (country, city, address). If the first two are matches, we're pretty sure we have a match.
Look at the email if we might have a match and compare the user portion to the new email and the stored email. Do the same for the host portion. This is just to raise the confidence level a bit.
This leaves about 5% that I need to check by hand. It's the first time I've had to go through with this so I'm being very cautious. Thank god for test suites.
Now that we've made it through the first year, it's time to get people to renew. I've sent out renewal emails, but I'm guessing 10% of them will never make it into an inbox. Spam has just about ruined email for anything.
If you've already received four issues, it's time for you to renew, and you can use subscription form and subscribe again.
Either way, I have an interesting task ahead. Since I don't store any personal information on the Pair.com servers that power our website, and we never store credit card information or do recurring billing, I get to match up the renewals with existing subscribers. It's easy to know which transactions are renewals, and each email I've sent has a link with a query string that I can link back to a subscriber. However, people might not follow the link but go directly to the webiste, or all sorts of other things that don't let me see that code. We'll see how that goes.
In related matters, I was rewriting the code bits that parse the email I get so I can shove all that stuff into the database. I was doing fancy things with ergexen, then Template::Extract, and some other things, and although it was a lot of fun, it was a big waste of time. Since I really just wanted to suck it into another program, why not send it as a ready-to-use data structure? It's easy enough to freeze things and unthaw them later. I still see my nicely formatted template, but at the end I include the Perl ready data. Besides the trivial parsing to isolate that, I'm ready to important things into the database. Things can be too simple to be seen sometimes.
I finally got to meet Eric Maki, TPRs designer, in person. We went through the current issue and looked at a lot of the design things we might want to change, and we also have an idea for an upcoming project I'll have details later.
If you haven't seen the latest cover (Summer 2005), get yourself an issue. It's so nice that we're going to make posters of it.
After a long day at YAPC where I taught a 4-day Learning Perl course in a single day, and then a 5 hour boat dinner cruise, Josh McAdams of Perlcast interviewed me about The Perl Review. I'm not sure when he'll publish it.
The cool thing is that Josh is moving to Chicago. I might be able to get on Perlcast a bit more often. :)
I got an advnace copy of The Best Software Writing from Apress. It's a collection of writings compiled by Joel Spolsky. Anyone want to take a whack at reviewing this for the next issue?
The Perl Review Summer 2005 issues should be in the mail today. The printer was screwing around again. They've been a disaster. You know the sort of people: they can't just do something without sending four emails back and forth, each with some new reason why things can't happen right away.
I'm thinking about making a special edition print issue with the best articles from the first two years. Good or bad idea? How much would you pay for that sort of thing? Which articles would you want?
The next issue is scheduled for December 1, and I'm looking for book reviews. Any recent book might work, but I'm specifically interested in reviews of:
Reviews should be 200 to 300 words and should reflect your opinion. Say explicitly who should buy this book and why they should, or that nobody should buy the book. No matter what you say, be nice.