<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">package HTML::HeadParser;

=head1 NAME

HTML::HeadParser - Parse &lt;HEAD&gt; section of a HTML document

=head1 SYNOPSIS

 require HTML::HeadParser;
 $p = HTML::HeadParser-&gt;new;
 $p-&gt;parse($text) and  print "not finished";

 $p-&gt;header('Title')          # to access &lt;title&gt;....&lt;/title&gt;
 $p-&gt;header('Content-Base')   # to access &lt;base href="http://..."&gt;
 $p-&gt;header('Foo')            # to access &lt;meta http-equiv="Foo" content="..."&gt;
 $p-&gt;header('X-Meta-Author')  # to access &lt;meta name="author" content="..."&gt;
 $p-&gt;header('X-Meta-Charset') # to access &lt;meta charset="..."&gt;

=head1 DESCRIPTION

The C&lt;HTML::HeadParser&gt; is a specialized (and lightweight)
C&lt;HTML::Parser&gt; that will only parse the E&lt;lt&gt;HEAD&gt;...E&lt;lt&gt;/HEAD&gt;
section of an HTML document.  The parse() method
will return a FALSE value as soon as some E&lt;lt&gt;BODY&gt; element or body
text are found, and should not be called again after this.

Note that the C&lt;HTML::HeadParser&gt; might get confused if raw undecoded
UTF-8 is passed to the parse() method.  Make sure the strings are
properly decoded before passing them on.

The C&lt;HTML::HeadParser&gt; keeps a reference to a header object, and the
parser will update this header object as the various elements of the
E&lt;lt&gt;HEAD&gt; section of the HTML document are recognized.  The following
header fields are affected:

=over 4

=item Content-Base:

The I&lt;Content-Base&gt; header is initialized from the E&lt;lt&gt;base
href="..."&gt; element.

=item Title:

The I&lt;Title&gt; header is initialized from the E&lt;lt&gt;title&gt;...E&lt;lt&gt;/title&gt;
element.

=item Isindex:

The I&lt;Isindex&gt; header will be added if there is a E&lt;lt&gt;isindex&gt;
element in the E&lt;lt&gt;head&gt;.  The header value is initialized from the
I&lt;prompt&gt; attribute if it is present.  If no I&lt;prompt&gt; attribute is
given it will have '?' as the value.

=item X-Meta-Foo:

All E&lt;lt&gt;meta&gt; elements containing a C&lt;name&gt; attribute will result in
headers using the prefix C&lt;X-Meta-&gt; appended with the value of the
C&lt;name&gt; attribute as the name of the header, and the value of the
C&lt;content&gt; attribute as the pushed header value.

E&lt;lt&gt;meta&gt; elements containing a C&lt;http-equiv&gt; attribute will result
in headers as in above, but without the C&lt;X-Meta-&gt; prefix in the
header name.

E&lt;lt&gt;meta&gt; elements containing a C&lt;charset&gt; attribute will result in
an C&lt;X-Meta-Charset&gt; header, using the value of the C&lt;charset&gt;
attribute as the pushed header value.

The ':' character can't be represented in header field names, so
if the meta element contains this char it's substituted with '-'
before forming the field name.

=back

=head1 METHODS

The following methods (in addition to those provided by the
superclass) are available:

=over 4

=cut


require HTML::Parser;
our @ISA = qw(HTML::Parser);

use HTML::Entities ();

use strict;
our $DEBUG;
#$DEBUG = 1;
our $VERSION = '3.81';

=item $hp = HTML::HeadParser-&gt;new

=item $hp = HTML::HeadParser-&gt;new( $header )

The object constructor.  The optional $header argument should be a
reference to an object that implement the header() and push_header()
methods as defined by the C&lt;HTTP::Headers&gt; class.  Normally it will be
of some class that is a or delegates to the C&lt;HTTP::Headers&gt; class.

If no $header is given C&lt;HTML::HeadParser&gt; will create an
C&lt;HTTP::Headers&gt; object by itself (initially empty).

=cut

sub new
{
    my($class, $header) = @_;
    unless ($header) {
	require HTTP::Headers;
	$header = HTTP::Headers-&gt;new;
    }

    my $self = $class-&gt;SUPER::new(api_version =&gt; 3,
				  start_h =&gt; ["start", "self,tagname,attr"],
				  end_h   =&gt; ["end",   "self,tagname"],
				  text_h  =&gt; ["text",  "self,text"],
				  ignore_elements =&gt; [qw(script style)],
				 );
    $self-&gt;{'header'} = $header;
    $self-&gt;{'tag'} = '';   # name of active element that takes textual content
    $self-&gt;{'text'} = '';  # the accumulated text associated with the element
    $self;
}

=item $hp-&gt;header;

Returns a reference to the header object.

=item $hp-&gt;header( $key )

Returns a header value.  It is just a shorter way to write
C&lt;$hp-E&lt;gt&gt;header-E&lt;gt&gt;header($key)&gt;.

=cut

sub header
{
    my $self = shift;
    return $self-&gt;{'header'} unless @_;
    $self-&gt;{'header'}-&gt;header(@_);
}

sub as_string    # legacy
{
    my $self = shift;
    $self-&gt;{'header'}-&gt;as_string;
}

sub flush_text   # internal
{
    my $self = shift;
    my $tag  = $self-&gt;{'tag'};
    my $text = $self-&gt;{'text'};
    $text =~ s/^\s+//;
    $text =~ s/\s+$//;
    $text =~ s/\s+/ /g;
    print "FLUSH $tag =&gt; '$text'\n"  if $DEBUG;
    if ($tag eq 'title') {
	my $decoded;
	$decoded = utf8::decode($text) if $self-&gt;utf8_mode &amp;&amp; defined &amp;utf8::decode;
	HTML::Entities::decode($text);
	utf8::encode($text) if $decoded;
	$self-&gt;{'header'}-&gt;push_header(Title =&gt; $text);
    }
    $self-&gt;{'tag'} = $self-&gt;{'text'} = '';
}

# This is an quote from the HTML3.2 DTD which shows which elements
# that might be present in a &lt;HEAD&gt;...&lt;/HEAD&gt;.  Also note that the
# &lt;HEAD&gt; tags themselves might be missing:
#
# &lt;!ENTITY % head.content "TITLE &amp; ISINDEX? &amp; BASE? &amp; STYLE? &amp;
#                            SCRIPT* &amp; META* &amp; LINK*"&gt;
#
# &lt;!ELEMENT HEAD O O  (%head.content)&gt;
#
# From HTML 4.01:
#
# &lt;!ENTITY % head.misc "SCRIPT|STYLE|META|LINK|OBJECT"&gt;
# &lt;!ENTITY % head.content "TITLE &amp; BASE?"&gt;
# &lt;!ELEMENT HEAD O O (%head.content;) +(%head.misc;)&gt;
#
# From HTML 5 as of WD-html5-20090825:
#
# One or more elements of metadata content, [...]
# =&gt; base, command, link, meta, noscript, script, style, title

sub start
{
    my($self, $tag, $attr) = @_;  # $attr is reference to a HASH
    print "START[$tag]\n" if $DEBUG;
    $self-&gt;flush_text if $self-&gt;{'tag'};
    if ($tag eq 'meta') {
	my $key = $attr-&gt;{'http-equiv'};
	if (!defined($key) || !length($key)) {
	    if ($attr-&gt;{name}) {
		$key = "X-Meta-\u$attr-&gt;{name}";
	    } elsif ($attr-&gt;{charset}) { # HTML 5 &lt;meta charset="..."&gt;
		$key = "X-Meta-Charset";
		$self-&gt;{header}-&gt;push_header($key =&gt; $attr-&gt;{charset});
		return;
	    } else {
		return;
	    }
	}
	$key =~ s/:/-/g;
	$self-&gt;{'header'}-&gt;push_header($key =&gt; $attr-&gt;{content});
    } elsif ($tag eq 'base') {
	return unless exists $attr-&gt;{href};
	(my $base = $attr-&gt;{href}) =~ s/^\s+//; $base =~ s/\s+$//; # HTML5
	$self-&gt;{'header'}-&gt;push_header('Content-Base' =&gt; $base);
    } elsif ($tag eq 'isindex') {
	# This is a non-standard header.  Perhaps we should just ignore
	# this element
	$self-&gt;{'header'}-&gt;push_header(Isindex =&gt; $attr-&gt;{prompt} || '?');
    } elsif ($tag =~ /^(?:title|noscript|object|command)$/) {
	# Just remember tag.  Initialize header when we see the end tag.
	$self-&gt;{'tag'} = $tag;
    } elsif ($tag eq 'link') {
	return unless exists $attr-&gt;{href};
	# &lt;link href="http:..." rel="xxx" rev="xxx" title="xxx"&gt;
	my $href = delete($attr-&gt;{href});
	$href =~ s/^\s+//; $href =~ s/\s+$//; # HTML5
	my $h_val = "&lt;$href&gt;";
	for (sort keys %{$attr}) {
	    next if $_ eq "/";  # XHTML junk
	    $h_val .= qq(; $_="$attr-&gt;{$_}");
	}
	$self-&gt;{'header'}-&gt;push_header(Link =&gt; $h_val);
    } elsif ($tag eq 'head' || $tag eq 'html') {
	# ignore
    } else {
	 # stop parsing
	$self-&gt;eof;
    }
}

sub end
{
    my($self, $tag) = @_;
    print "END[$tag]\n" if $DEBUG;
    $self-&gt;flush_text if $self-&gt;{'tag'};
    $self-&gt;eof if $tag eq 'head';
}

sub text
{
    my($self, $text) = @_;
    print "TEXT[$text]\n" if $DEBUG;
    unless ($self-&gt;{first_chunk}) {
	# drop Unicode BOM if found
	if ($self-&gt;utf8_mode) {
	    $text =~ s/^\xEF\xBB\xBF//;
	}
	else {
	    $text =~ s/^\x{FEFF}//;
	}
	$self-&gt;{first_chunk}++;
    }
    my $tag = $self-&gt;{tag};
    if (!$tag &amp;&amp; $text =~ /\S/) {
	# Normal text means start of body
        $self-&gt;eof;
	return;
    }
    return if $tag ne 'title';
    $self-&gt;{'text'} .= $text;
}

BEGIN {
    *utf8_mode = sub { 1 } unless HTML::Entities::UNICODE_SUPPORT;
}

1;

__END__

=back

=head1 EXAMPLE

 $h = HTTP::Headers-&gt;new;
 $p = HTML::HeadParser-&gt;new($h);
 $p-&gt;parse(&lt;&lt;EOT);
 &lt;title&gt;Stupid example&lt;/title&gt;
 &lt;base href="http://www.linpro.no/lwp/"&gt;
 Normal text starts here.
 EOT
 undef $p;
 print $h-&gt;title;   # should print "Stupid example"

=head1 SEE ALSO

L&lt;HTML::Parser&gt;, L&lt;HTTP::Headers&gt;

The C&lt;HTTP::Headers&gt; class is distributed as part of the
I&lt;libwww-perl&gt; package.  If you don't have that distribution installed
you need to provide the $header argument to the C&lt;HTML::HeadParser&gt;
constructor with your own object that implements the documented
protocol.

=head1 COPYRIGHT

Copyright 1996-2001 Gisle Aas. All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=cut
</pre></body></html>