[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

Working With Html In Haskell

updated: April 27, 2012

This is a complete guide to using HXT for parsing and processing HTML in Haskell.

What is HXT?

HXT is a collection of tools for processing XML with Haskell. It's a complex beast, but HXT is powerful and flexible, and very elegant once you know how to use it.

Why HXT?

Here's how HXT stacks up against some other XML parsers:

HXT vs TagSoup

TagSoup is the crowd favorite for HTML scraping in Haskell, but it's a bit too basic for my needs.

HXT vs HaXml

HXT is based on HaXml. The two are very similar, but I think HXT is a little more elegant.

HXT vs hexpat

hexpat is a high-performance xml parser. It might be more appropriate depending on your use case. hexpat lacks a collection of tools for processing the HTML, but you can try Parsec for that bit.

HXT vs xml (Text.XML.Light)

I haven't used Text.XML.Light. If you have used it and liked it, please let me know!

The one thing all these packages have in common is poor documentation.

Einstein after spending several hours trying to find documentation on XML parsers

Hello World

To whet your appetite, here's a simple script that uses HXT to get all links on a page:

import Text.XML.HXT.Core

main = do
  html <- readFile "test.html"
  let doc = readString [withParseHTML yes, withWarnings no] html
  links <- runX $ doc //> hasName "a" >>> getAttrValue "href"
  mapM_ putStrLn links

Understanding Arrows

I don't assume any prior knowledge of Arrows. In fact, one of the goals of this guide is to help you understand Arrows a little better.

The Least You Need to Know About Arrows

Arrows are a way of representing computations that take an input and return an output. All Arrows take a value of type a and return a value of type b. All Arrow types look like Arrow a b:

-- an Arrow that takes an `a` and returns a `b`:
arrow1 :: SomeType a b

-- an Arrow that takes a `b` and returns a `c`:
arrow2 :: SomeType b c

-- an Arrow that takes a `String` and returns an `Int`:
arrow3 :: SomeType String Int

Arrows sound like functions! In fact, functions are arrows.

-- a function that takes an Int and returns a Bool
odd :: Int -> Bool

-- also, an Arrow that takes an Int and returns a Bool
odd :: (->) Int Bool

Don't get confused by the two different type signatures! Int -> Bool is just the infix way of writing (->) Int Bool.

Arrow Composition

You'll be using >>> a lot with HXT, so it's a good idea to understand how it works.

>>> composes two arrows into a new arrow.

We could compose length and odd like so: odd . length.

Since functions are Arrows, we could also compose them like so: length >>> odd or odd <<< length.

They're all exactly the same!

ghci> odd . length $ [1, 2, 3]
True
ghci> length >>> odd $ [1, 2, 3]
True
ghci> odd <<< length $ [1, 2, 3]
True

A function is the most basic type of arrow, but there are many other types. HXT defines its own Arrows, and we will be working with them a lot.

Let's get started. Don't worry if Arrows still seem unclear. We will be writing a lot of examples, so they should become clear soon enough.

Getting Started

Step 1: Install HXT:

cabal install hxt

Step 2: Install HandsomeSoup:

cabal install HandsomeSoup

HandsomeSoup contains a powerful css function that will allow us to access elements using css selectors. We will use this function until we can write a basic version of it ourselves as explained here. For more info about HandsomeSoup, see this section.

Step 3: Here's the HTML we'll be working with:

<html><head><title>The Dormouse's story</title></head>
  <body>
<p class='title'><b>The Dormouse's story</b></p>

<p class='story'>Once upon a time there were three little sisters; and their names were
<a href='http://example.com/elsie' class='sister' id='link1'>Elsie</a>,
<a href='http://example.com/lacie' class='sister' id='link2'>Lacie</a> and
<a href='http://example.com/tillie' class='sister' id='link3'>Tillie</a>;
and they lived at the bottom of a well.</p>

<p class='story'>Some text</p>
</body>
</html>

Save it as test.html.

Step 4: Import HXT, HandsomeSoup, and the html file into ghci:

import Text.XML.HXT.Core
import Text.HandsomeSoup
html <- readFile "test.html"

Parse a String as HTML

Use readString:

ghci> let doc = readString [withParseHTML yes, withWarnings no] html

doc is now a parsed HTML document, ready to be processed!

Now we can do things like getting all links in the document:

ghci> doc >>> css "a"

Arrow Interlude #1: HXT Arrows

You just used your first Arrow! css is an Arrow. Here's its type:

ghci> :t css
css :: ArrowXml a => String -> a XmlTree XmlTree

So css takes an XmlTree and returns another XmlTree. A lot of Arrows in HXT have this type: they all transform the current tree and return a new tree.

Extracting Content

doc is wrapped in an IOStateArrow. If you try to see the contents of doc, you'll get an error:

<interactive>:1:1:
    No instance for (Show (IOSLA (XIOState s0) a0 XmlTree))
      arising from a use of `print'
    Possible fix:
      add an instance declaration for
      (Show (IOSLA (XIOState s0) a0 XmlTree))
    In a stmt of an interactive GHCi command: print it

Use runX to extract the contents.

contents <- runX doc
print contents

Prints out:

[NTree (XTag "/" [NTree (XAttr "transfer-Status")...]

Pretty-printing

I don't want to see ugly Haskell types. Let's use xshow to convert our tree to HTML:

res <- runX . xshow $ doc
mapM_ putStrLn res

Prints out:

</ transfer-Status="200" transfer-Message="OK" transfer-URI="string:" source=""<html><head><title>The Dormouse's story</titl..."" transfer-Encoding="UNICODE"><html>
<head>
<title>The Dormouse's story</title>
...

Much better! Now use indentDoc to add proper indentation:

res <- runX . xshow $ doc >>> indentDoc
mapM_ putStrLn res

Prints out:

</ transfer-Status="200" transfer-Message="OK" transfer-URI="string:" source=""<html><head><title>The Dormouse's story</titl..."" transfer-Encoding="UNICODE"><html>
  <head>
    <title>The Dormouse's story</title>
...

Perfect.

Selecting Elements

Note: To keep things simple for now, these examples make use of our custom css Arrow.

Get all a tags

doc >>> css "a"
doc >>> css "a" >>> hasAttr "id"

Get all values for an attribute

doc >>> css "a" >>> getAttrValue "href"

See how easy it is to chain transformations together using >>>? Notice how using getAttrValue gets us the links for all a tags, instead of just one:

ghci>runX $ doc >>> css "a" >>> getAttrValue "href"
["http://example.com/elsie","http://example.com/lacie","http://example.com/tillie"]

This is a core idea behind HXT. In HXT, everything you do is a series of transformations on the whole tree. So you can use getAttrValue and HXT will automatically apply it to all the elements.

doc >>> css "a" >>> hasAttrValue "id" (== "link1")

Get multiple values at once

Use <+>:

-- get all p tags as well as all a tags
doc >>> css "p" <+> css "a"

Get all element names

doc //> hasAttr "id" >>> getElemName

We used the special function "//>" here! It's covered in this section.

Get all elements where the text contains "mouse"

import Data.List
runX $ doc //> hasText (isInfixOf "mouse")

Get the element's name and the value of id

ghci> runX $ doc //> hasAttr "id" >>> (getElemName &&& getAttrValue "id")
[("a","link1"),("a","link2"),("a","link3")]

Let's talk about the &&& function.

Arrow Interlude #2

&&& is a function for Arrows. The best way to see how it works is by example:

ghci> length >>> (odd &&& (+1)) $ ["one", "two", "twee"]
(True,4)

&&& takes two arrows and creates a new arrow. In the above example, the output of length is fed into both odd and (+1), and both return values are combined into a tuple (True, 4).

We used &&& to get an element's name and its id: (getElemName &&& getAttrValue "id").

Why is this function useful? Suppose we want to get all attributes on links:

runX $ doc >>> css "a" >>> getAttrl >>> getAttrName

Here's where it's nice to have &&&. The above line gives you something like this:

["href","class","id","href","class","id","href","class","id"]

The only problem: you have no idea what element each attribute belongs to! Use &&& to get a reference to the element as well:

ghci> runX $ doc >>> css "a" >>> (this &&& (getAttrl >>> getAttrName))
[(...some element..., "href"), (...another element..., "class")..etc..]

HXT has lots of other arrows for selecting elements. See the docs for more.

Children and Descendents

HXT has a few different functions for working with children, and it can be tricky to decide which one to use.

So far we have been using the css function to get elements. Now let's see how we could implement a basic version of it:

css tag = multi (hasName tag)

css uses hasName to get elements with a given tag. Why don't we just use hasName instead of css?

ghci> runX $ doc >>> hasName "a"
[]

hasName only works on the current node, and ignores its descendents, whereas css allows us to look in the entire tree for elements. Here are some arrows for looking in the entire tree:

getChildren and multi

We could use getChildren to get the immediate child nodes:

ghci>runX $ doc >>> getChildren >>> getName
["html"]

But what if we want the names of all descendents, not just the immediate child node? Use multi:

ghci> runX $ doc >>> multi getName
["/","html","head","title","body","p","b","p","a","a","a","p"]

multi recursively applies an Arrow to an entire subtree. css uses multi to search across the entire tree for nodes.

deep and deepest

These two Arrows are related to multi.

deep recursively searches a whole tree for subtrees, for which a predicate holds. The search is performed top down. When a tree is found, this becomes an element of the result list. The tree found is not further examined for any subtress, for which the predicate also could hold:

-- deep successfully got the name of the root element,
-- so it didn't go through the child nodes of that element.
ghci> runX $ doc >>> deep getName
["/"]

-- here, deep will get all p tags but it won't look for
-- nested p tags (multi *will* look for nested p tags)
ghci>runX $ doc >>> deep (hasName "p") >>> getName
["p","p","p"]

deepest is similar to deep but performs the search from the bottom up:

ghci> runX $ doc >>> deepest getName
["title","b","a","a","a","p"]

/> and //>

/> looks for a direct child (i.e. what getChildren does).

//> looks for a node somewhere under this one (i.e. what deep does).

So, these two lines are equivalent:

doc /> getText
doc >>> getChildren >>> getText

And these two lines are equivalent:

doc //> getText
doc >>> getChildren >>> (deep getText)

See docs for more.

Working With Text

Get the text in an element

ghci>runX $ doc >>> css "title" /> getText
["The Dormouse's story"]

Remember, this is the same as writing:

runX $ doc >>> multi (hasName "title") >>> getChildren >>> getText

Get the text in an element + all its descendents

doc >>> css "body" //> getText

Try using /> instead of //>. What do you get?

The wrong way:

ghci> runX $ doc >>> css "a" >>> (getAttrValue "href" &&& getText)
[]

This returns [] because doc >>> css "a" >>> getText returns [].

We need to go deeper! (i.e. use deep):

ghci> runX $ doc >>> css "a" >>> (getAttrValue "href" &&& (deep getText))
[("http://example.com/elsie","Elsie"),("http://example.com/lacie","Lacie"),("http://example.com/tillie","Tillie")]

Remove Whitespace

Use removeAllWhiteSpace. It removes all nodes containing only whitespace.

runX $ doc >>> css "body" >>> removeAllWhiteSpace //> getText

If you have used BeautifulSoup, this is kinda like the stripped_strings method.

Modifying a Node

Modifying text

Use changeText. Here's how you uppercase all the text in p tags:

import Data.Char
uppercase = map toUpper
runX . xshow $ doc >>> css "p" /> changeText uppercase

Add or change an attribute

Use addAttr:

runX . xshow $ doc >>> css "p" >>> addAttr "id" "my-own-id"

Modifying Children

processChildren and processTopDown allow you to modify the children of an element.

Add an id to the children of the root node

-- adds an id to the <html> tag
runX . xshow $ doc >>> processChildren (addAttr "id" "foo")

Add an id to all descendents of the root node

-- adds an id to all tags
runX . xshow $ doc >>> processTopDown (addAttr "id" "foo")

processChildren is similar to getChildren, except that instead of returning the children, it modifies them in place and returns the entire tree.

processTopDown is similar to multi.

processTopDownUntil is similar to deep.

Conditionals (ifA)

HXT has some useful functions that allow us to apply Arrows based on a predicate.

Using ifA:

ifA is the if statement for Arrows. It's used as ifA (predicate Arrow) (do if true) (do if false).

Uppercase all the text for p tags only:

runX . xshow $ doc >>> processTopDown (ifA (hasName "p") (getChildren >>> changeText uppercase) (this))

We use the identity arrow this here. You can read this as: if the element is a p tag, uppercase it, otherwise pass it through unchanged.

this has a complementary arrow called none. none is the zero arrow. Here's how we can use none to remove all p tags:

runX $ doc >>> processTopDown (ifA (hasName "p") (none) (this))

More Conditionals (when, guards, and filterA)

when and guards can make your ifA code easier to read.

Uppercasing text for p tags using when instead of ifA

runX . xshow $ doc >>> processTopDown ((getChildren >>> changeText uppercase) `when` hasName "p")

f `when` g -- when the predicate `g` holds, `f` is applied, else the identity filter `this`.

Deleting all p tags using guards

runX $ doc >>> processTopDown (neg (hasName "p") `guards` this)

g `guards` f -- when the predicate `g` holds, `f` is applied, else `none`.

Deleting all p tags using filterA

runX $ doc >>> processTopDown (filterA $ neg (hasName "p"))

filterA f -- a shortcut for f `guards` this

Using Functions as Predicates

How would we get all nodes that have "mouse" in the text? Here's one way:

runX $ doc //> hasText (isInfixOf "mouse")

But if the hasText function didn't exist, we could write it ourselves! Here's how:

First, import Text.XML.HXT.DOM.XmlNode. It defines several functions that work on Nodes.

import qualified Text.XML.HXT.DOM.XmlNode as XN

(Note the qualified import...this module has a lot of names that conflict with HXT.Core).

Here's a function that returns true if the given node's text contains "mouse":

import Data.Maybe
import Data.List

hasMouse n = "mouse" `isInfixOf` text
  where text = fromMaybe "" (XN.getText n)

isA lifts a predicate function to an HXT Arrow. Combined with isA, we can use hasMouse to filter out all nodes that don't have mouse as part of their text:

runX $ doc //> isA hasMouse

We can use isA wherever a predicate Arrow is needed: ifA, when, guards etc.

See the docs for more conditionals for Arrows.

See these docs for more functions you can use to write your own Arrows.

Using Haskell Functions

Suppose we have an array of link texts:

ghci>runX $ doc >>> css "a" //> getText
["Elsie","Lacie","Tillie"]

And we want to get the length of each bit of text. So we need an arrow version of the length function.

We can lift the length function into an HXT arrow using arr:

ghci> runX $ doc >>> css "a" //> getText >>> arr length
[5,5,6]

Note how length automatically gets applied to each element without us having to use map. This is because Arrows in HXT always apply to the entire tree, not just one node. This behaviour is abstracted away so that you can just write a function that works on one node and have it apply to every node in the tree automatically.

Working With Lists

This section was written after Ywen asked this question on Reddit. So far, we have applied arrows to one node at a time. In the previous section, we applied length to every node individually. What if we wanted to work with all the nodes at once, to do a map or a foldl over them?

HXT has some special functions that allow you to work on the entire list of elements, instead of working on just one element.

>>. and >.

We already know how to get the text for all links:

ghci> runX $ doc >>> css "a" //> getText
["Elsie","Lacie","Tillie"]

How do we get the text with the results reversed? Use >>.:

ghci> runX $ (doc >>> css "a" //> getText) >>. reverse
["Tillie","Lacie","Elsie"]

>>. takes a function that takes a list, and returns a list, so it allows us to use all our Haskell list functions.

We could sort all the letters in the names:

ghci> import Data.List
ghci> runX $ (doc >>> css "a" //> getText) >>. (map sort)
["Eeils","Lacei","Teiill"]

How do we count the number of links in the doc? Use >.:

ghci> runX $ (doc >>> css "a" //> getText) >. length
[3]

>. takes a function that takes a list and returns a single value.

Getting the length of the text of all links combined:

ghci> runX $ (doc >>> css "a" //> getText >>. concat) >. length
[16]

The parentheses are important here!

-- Counts the number of links in the doc
ghci> runX $ (doc >>> css "a" //> getText) >. length
[3]

-- Oops! Runs `>. length` on each link individually
ghci> runX $ doc >>> css "a" //> getText >. length
[1,1,1]

Introducing HandsomeSoup

HandsomeSoup is an extension for HXT that provides a complete CSS2 selector implementation, so you can complicated selectors like:

doc >>> css "h1#title"
doc >>> css "li > a.link:first-child"
doc >>> css "h2[lang|=en]"

...or any other valid CSS2 selector. Here are some other goodies it provides:

Getting Attributes With HandsomeSoup

Use ! instead of getAttrValue:

doc >>> css "a" ! "href"

Scraping Online Pages

Use fromUrl to download and parse pages:

doc <- fromUrl url
links <- runX $ doc >>> css "a" ! "href"

Downloading Content

Use openUrl:

content <- runMaybeT $ openUrl url
case content of
    Nothing -> putStrLn $ "Error: " ++ url
    Just content' -> writeFile "somefile" content'

Parse Strings

Use parseHtml:

contents <- readFile [filename]
doc <- parseHtml contents

Avoiding IO

Look at the type of our html tree:

ghci>:t doc
doc :: IOSArrow XmlTree (NTree XNode)

It's in IO! This means that any function that parses the html will have to be IO. What if you want a pure function for parsing the html?

You can use hread:

-- old way:
ghci> let old = runX doc

-- using hread:
ghci> let new = runLA hread contents

And here are their types:

ghci> :t old
old :: IO [XmlTree] -- IO!

ghci> :t new
new :: [XmlTree] -- no IO!
ghci> runLA (hread >>> css "a" //> getText) contents
["Elsie","Lacie","Tillie"]

So why haven't we been using hread? Because IOSArrow is much more powerful; it gives you IO + State. hread is also much more stripped down. From the docs:

parse a string as HTML content, substitute all HTML entity refs and canonicalize tree. (substitute char refs, ...). Errors are ignored. This is a simpler version of readFromString without any options.

Debugging

HXT provides arrows to print out the current tree at any time. These arrows are very handy for debugging.

Use traceTree:

doc >>> css "h1" >>> withTraceLevel 5 traceTree >>> getAttrValue "id"

traceTree needs level >= 4.

Use traceMsg for sprinkling printf-like statements:

doc >>> css "h1" >>> traceMsg 1 "got h1 elements" >>> getAttrValue "id"

See the docs for even more trace functions.

Epilogue

I hope you found this guide helpful in your quest to work with HTML using Haskell.

"Haskell is awesome! Totes m' goats!" - Albert Einstein

Key Modules For Working With HXT

Arrows for working with nodes (the core stuff).

Arrows for working with children.

Conditional Arrows.

Function versions of most Arrows (Useful with arr or isA).

The HXT tutorial on haskell.org.

Practical HXT.

Understanding Arrows.

Privacy Policy