Can you provide examples of parsing HTML?

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format

Language: [language name]

Library: [library name]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

Purpose: [what the parse does]


language-agnostic , copy ,

  Answer...
26
Language: JavaScript
Library: jQuery

$.each($('a[href]'), function(){
    console.debug(this.href);
});

(using firebug console.debug for output...)

And loading any html page:

$.get('http://stackoverflow.com/', function(page){
     $(page).find('a[href]').each(function(){
        console.debug(this.href);
    });
});

Used another each function for this one, I think it's cleaner when chaining methods.

23
Language: C#
Library: HtmlAgilityPack

class Program
{
    static void Main(string[] args)
    {
    	var web = new HtmlWeb();
    	var doc = web.Load("http://www.stackoverflow.com");

    	var nodes = doc.DocumentNode.SelectNodes("//a[@href]");

    	foreach (var node in nodes)
    	{
    		Console.WriteLine(node.InnerHtml);
    	}
    }
}
19
language: Python
library: BeautifulSoup

from BeautifulSoup import BeautifulSoup

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html  = '<a href="http://%s.com">%s</a>' % (link, link)
html  = "</body></html>"

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links

output:

[<a href="http://foo.com">foo</a>,
 <a href="http://bar.com">bar</a>,
 <a href="http://baz.com">baz</a>]

also possible:

for link in links:
    print link['href']

output:

http://foo.com
http://bar.com
http://baz.com
17
Language: Perl
Library: pQuery

use strict;
use warnings;
use pQuery;

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

pQuery( $html )->find( 'a' )->each(
    sub {  
        my $at = $_->getAttribute( 'href' ); 
        print "$at\n" if defined $at;
    }
);

/I3az/

13
language: Ruby
library: Hpricot

#!/usr/bin/ruby

require 'hpricot'

html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html  = "<a href=\"http://#{link}.com\">#{link}</a>" }
html  = '</body></html>'

doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }
12
language: shell
library: lynx (well, it's not library, but in shell, every program is kind-of library)

lynx -dump -listonly http://news.google.com/
10
language: Perl
library: HTML::Parser

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my $find_links = HTML::Parser->new(
    start_h => [
    	sub {
    		my ($tag, $attr) = @_;
    		if ($tag eq 'a' and exists $attr->{href}) {
    			print "$attr->{href}\n";
    		}
    	}, 
    	"tag, attr"
    ]
);

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

$find_links->parse($html);
9
language: Python
library: HTMLParser

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindLinks(HTMLParser):
    def __init__(self):
    	HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
    	at = dict(attrs)
    	if tag == 'a' and 'href' in at:
    		print at['href']


find = FindLinks()

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html  = '<a href="http://%s.com">%s</a>' % (link, link)
html  = "</body></html>"

find.feed(html)
9
Language Perl
Library: HTML::LinkExtor

Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

Whole program:

#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = 'http://www.google.com/';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );

exit;

sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";
    return;
}

Explanation:

  • use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
  • use HTML::LinkExtor - load of interesting module
  • use LWP::Simple - just a simple way to get some html for tests
  • my $url = 'http://www.google.com/' - which page we will be extracting urls from
  • my $content = get( $url ) - fetches page html
  • my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
  • $p->parse( $content ) - pretty obvious I guess
  • exit - end of program
  • sub process_link - begin of function process_link
  • my ($tag, %attr) - get arguments, which are tag name, and its atributes
  • return unless $tag eq 'a' - skip processing if the tag is not <a>
  • return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
  • print "- $attr{'href'}\n"; - pretty obvious I guess :)
  • return; - finish the function

That's all.

8
Language: Ruby
Library: Nokogiri

#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"
8
Language: Common Lisp
Library: Closure Html, Closure Xml, CL-WHO

(shown using DOM API, without using XPATH or STP API)

(defvar *html*
  (who:with-html-output-to-string (stream)
    (:html
     (:body (loop
               for site in (list "foo" "bar" "baz")
               do (who:htm (:a :href (format nil "http://~A.com/" site))))))))

(defvar *dom*
  (chtml:parse *html* (cxml-dom:make-dom-builder)))

(loop
   for tag across (dom:get-elements-by-tag-name *dom* "a")
   collect (dom:get-attribute tag "href"))
=> 
("http://foo.com/" "http://bar.com/" "http://baz.com/")
6
Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)


Selector expression:

(def test-select
     (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))

Now we can do the following at the REPL (I've added line breaks in test-select):

user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
 {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
 {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")

You'll need the following to try it out:

Preamble:

(require '[net.cgrand.enlive-html :as html])

Test HTML:

(def test-html
     (apply str (concat ["<html><body>"]
                        (for [link ["foo" "bar" "baz"]]
                          (str "<a href=\"http://" link ".com/\">" link "</a>"))
                        ["</body></html>"])))
5
language: Perl
library: XML::Twig

#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';

use LWP::Simple;
use XML::Twig;

#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;

my $twig = XML::Twig->new();
$twig->parse_html($content);

my @hrefs = map {
    $_->att('href');
} $twig->get_xpath('//*[@href]');

print "$_\n" for @hrefs;

caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.

5
Language: Java
Libraries: XOM, TagSoup

I've included intentionally malformed and inconsistent XML in this sample.

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Parser parser = new Parser();
        parser.setFeature(Parser.namespacesFeature, false);
        final Builder builder = new Builder(parser);
        final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
        final Element root = document.getRootElement();
        final Nodes links = root.query("//a[@href]");
        for (int linkNumber = 0; linkNumber < links.size();   linkNumber) {
            final Node node = links.get(linkNumber);
            System.out.println(((Element) node).getAttributeValue("href"));
        }
    }
}

TagSoup adds an XML namespace referencing XHTML to the document by default. I've chosen to suppress that in this sample. Using the default behavior would require the call to root.query to include a namespace like so:

root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())
4
Language: JavaScript
Library: DOM

var links = document.links;
for(var i in links){
    var href = links[i].href;
    if(href != null) console.debug(href);
}

(using firebug console.debug for output...)

3
Language: C#
Library: System.XML (standard .NET)

using System.Collections.Generic;
using System.Xml;

public static void Main(string[] args)
{
    List<string> matches = new List<string>();

    XmlDocument xd = new XmlDocument();
    xd.LoadXml("<html>...</html>");

    FindHrefs(xd.FirstChild, matches);
}

static void FindHrefs(XmlNode xn, List<string> matches)
{
    if (xn.Attributes != null && xn.Attributes["href"] != null)
        matches.Add(xn.Attributes["href"].InnerXml);

    foreach (XmlNode child in xn.ChildNodes)
        FindHrefs(child, matches);
}
3
Language: PHP
Library: SimpleXML (and DOM)

<?php
$page = new DOMDocument();
$page->strictErrorChecking = false;
$page->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xml = simplexml_import_dom($page);

$links = $xml->xpath('//a[@href]');
foreach($links as $link)
    echo $link['href']."\n";
3
Language: Objective-C
Library: libxml2 Matt Gallagher's libxml2 wrappers Ben Copsey's ASIHTTPRequest

ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
    NSData *response = [request responseData];
    NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
    [request release];
}
else 
    @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];

...

- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
    NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
    if (nodes != nil)
        return nodes;
    return nil;
}
3
Language: Perl
Library : HTML::TreeBuilder

use strict;
use HTML::TreeBuilder;
use LWP::Simple;

my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;

for my $a ($document->find('a')) {
    print $a->attr('href'), "\n" if $a->attr('href');
}
2
language: Python
library: lxml.html

import lxml.html

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html  = '<a href="http://%s.com">%s</a>' % (link, link)
html  = "</body></html>"

tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
    if attribute == "href":
        print link

lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:

for a in tree.cssselect('a[href]'):
    print a.get('href')
2
Language: Racket

Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)

(require net/url
         (planet ashinn/html-parser:1)
         (planet clements/sxml2:1))

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/@href/text()") doc))
1
Language: Python
Library: HTQL

import htql; 

page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";

for url, text in htql.HTQL(page, query): 
    print url, text;

Simple and intuitive.

1
language: Ruby
library: Nokogiri

#!/usr/bin/env ruby

require "nokogiri"
require "open-uri"

doc = Nokogiri::HTML(open('http://www.example.com'))
hrefs = doc.search('a').map{ |n| n['href'] }

puts hrefs

Which outputs:

/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:iana@iana.org?subject=General website feedback

This is a minor spin on the one above, resulting in an output that is usable for a report. I only return the first and last elements in the list of hrefs:

#!/usr/bin/env ruby

require "nokogiri"
require "open-uri"

doc = Nokogiri::HTML(open('http://nokogiri.org'))
hrefs = doc.search('a[href]').map{ |n| n['href'] }

puts hrefs
  .each_with_index                     # add an array index
  .minmax{ |a,b| a.last <=> b.last }   # find the first and last element
  .map{ |h,i| '= %s' % [1   i, h ] } # format the output

  1 http://github.com/tenderlove/nokogiri
100 http://yokolet.blogspot.com
1
Language: Java
Library: jsoup

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
        final Elements links = document.select("a[href]");
        for (final Element element : links) {
            System.out.println(element.attr("href"));
        }
    }
}
0
Language: PHP Library: DOM

<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);

$links = $xpath->query('//a[@href]');
for ($i = 0; $i < $links->length; $i  )
    echo $links->item($i)->getAttribute('href'), "\n";

Sometimes it's useful to put @ symbol before $doc->loadHTMLFile to suppress invalid html parsing warnings

0
Using phantomjs, save this file as extract-links.js:

var page = new WebPage(),
    url = 'http://www.udacity.com';

page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var results = page.evaluate(function() {
            var list = document.querySelectorAll('a'), links = [], i;
            for (i = 0; i < list.length; i  ) {
                links.push(list[i].href);
            }
            return links;
        });
        console.log(results.join('\n'));
    }
    phantom.exit();
});

run:

$ ../path/to/bin/phantomjs extract-links.js


Loading