XML::LibXMLでHTMLをパース - Whitebell::HatenaBlog

するのは、

use strict;
use warnings;
use XML::LibXML;

my $parser = XML::LibXML->new();
$parser->recover_silently(1);
my $doc = $parser->parse_html_file('http://blog.livedoor.jp/dankogai/');
print $doc->toString;

とやればいい*1のだけど、いくつかのファイルでinput conversion failed due to input errorってエラーで怒られるので調べてたら、まあエラーメッセージの通り、文字コード変換するときに変換できない文字で引っかかってる。XMLをパースするときはmiyagawa productのXML::Liberalを使えばそっちでうまいことやってくれる*2そうで。だけど、今回はXMLじゃなくてHTMLをパースしたいのだけど、XML::Liberal::Remedy::InvalidEncodingを見ても、さすがにHTMLをパースするときの処理まではやってくれないみたい。しょうがないので、ゆーすけべー日記を参考にして、自前でどうにかしてみる。やることは次の二つ。

HTMLテキストを&Encode::from_toでeuc-jpからutf-8に変換
正規表現で<meta http-equiv="Content-Type" content="text/html; charset=euc-jp">を<meta http-equiv="Content-Type" content="text/html; charset=utf-8">に置換

use Encode qw/from_to/;
use Perl6::Slurp;
use XML::LibXML;

my $parser = XML::LibXML->new;
$parser->recover_silently(1);

my $str = slurp 'f1.html';
from_to $str, 'euc-jp', 'utf-8';
$str =~ s{<meta http-equiv="Content-Type" content="text/html; charset=euc-jp">}
         {<meta http-equiv="Content-Type" content="text/html; charset=utf-8">};
my $tree = $parser->parse_html_string($str);

*1:XML::LibXMLでHTML文書を扱う - 徒書

*2:XML::Liberal and gaiji - Bulknews::Subtech - subtech