hpaste

recent | annotate | new

> module Main where
> import Data.Char (ord, chr)

How to write a parser in a functional language.
----

This is a short introduction into how to write a parser in a functional
language.  The method isn't that different than in an imperative
language except there is more book keeping that needs to be done.
For example, the parser can't simply mark data as being consumed.  It
has to pass around the remaining data as it parses.  There's an upside,
too -- using higher order functions can simplify building a big parser
from smaller parsers.


Parsing functions
----

We want to build a function that parses a string and returns parsed data.
The string might not always be parsable, though.  Whatever parsing
function we end up writing will need a way to return failure.  We'll
use "Maybe" for this:

    > data Maybe a = Nothing | Just a

If the parse function can't parse the string it will return Nothing.  
If it can parse the string it will return (Just a) with some returned
parse data "a".  A fancier parser might return an error string to
signify what went wrong, but our parser wont.

The task now is to write a function that takes an input string and
returns parsed data or an error.  We'll want our parser to have
a type like:

    > parser :: String -> Maybe a

for some returned parse type a.  For example:

    > parseInt :: String -> Maybe Int

We're going to want to build up our parser from many smaller parsers
that only parse a part of the input.  So instead we're going to use
a type like:

    > parseInt :: String -> Maybe (String, Int)

The function will take a string and return Nothing if the string can't
be parsed, but return Just (resid, val) if the string can be parsed.
The result will have the unconsumed part of the input as well as the
return value.  For example, to parse "123 abc" with parseInt we'd want:

    > parseInt "123 abc" == Just (" abc", 123)

Finally we'll be dealing with a lot of functions of this type, so we'll
wrap it up neatly as a data type:

> data Parser a = Parser (String -> Maybe (String, a))

Each value of type "Parser a" will be a parsing function.  For
example "Parser Int" will be a value of type

    Parser (String -> Maybe (String, Int))

To put the function to use we'll define a few helpers.  runParser runs
the enclosed parsing function on some input:

> runParser :: Parser a -> String -> Maybe (String, a)
> runParser (Parser p) inp = p inp

parseString will also run a parser, but will return the result if
the entire string was consumed or an error otherwise:

> parseString :: Parser a -> String -> Maybe a
> parseString p inp = 
>     case runParser p inp of
>         Nothing            -> Nothing
>         Just (resid, val)  -> if null resid
>                                   then Just val
>                                   else Nothing

For example, if we have a function intParser of type Parser Int which
parses an integer then we'd expect that

    > runParser intParser "123 abc" == Just (" abc", 123)

and 

    > parseString intParser "abc" == Nothing
    > parseString intParser "123" == Just 123



Some small parsers
---

What exactly does a parsing function look like?  They're actually
pretty simple.  Lets build a few.  Lets start with a function
that parses the character 'a' and fails for any other input.  The
parser will return the parsed character:

> parseA :: Parser Char
> parseA = Parser onlyA
> onlyA []       = Nothing
> onlyA ('a':xs) = Just (xs, 'a')
> onlyA _        = Nothing

Remember that String is just [Char], a list of characters.  onlyA
will return the rest of the string and the character 'a' if the
first character is 'a', and will return Nothing otherwise.  parseA
is a Parser Char for the onlyA function.  If we run this parser on 
the input "a" we get back Just 'a':

    > parseString parseA "a" == Just 'a'

If we run this parser on any other input, we get back Nothing:

    > parseString parseA "test" == Nothing

We could write a similar parser for the letter 'b'.  Instead, let's
generalize by writing a function which makes an arbitrary character
parser:

> char :: Char -> Parser Char
> char ch = Parser (onlyCh ch)
> onlyCh ch (x:xs) | x == ch  = Just (xs, x)
> onlyCh ch _                 = Nothing

Now we can build a parser of 'b' or any other character we want very
easily:
    > parseString (char 'b') "b" == Just 'b'
    > parseString (char 'b') "test" == Nothing



Combining Parsers
---

We can continue writing more sophisticated parsers that work in
a similar manner, and we will return to this in just a bit.  But often
its a lot simpler to build a parser by combining two existing parsers.
For example, lets say we wanted to parse the character 'a' followed
by the character 'b' and return the two parsed characters.  We could
write this parser as:

> parseAB :: Parser (Char, Char)
> parseAB = Parser onlyAB
>

> parseAB :: Parser (Char, Char)
> parseAB = Parser onlyAB
>   where onlyAB inp = 
>             case onlyA inp of
>                 Nothing         -> Nothing
>                 Just (inp', x1) -> 
>                     case (onlyCh 'b') inp' of
>                         Nothing -> Nothing
>                         Just (inp', x2) -> Just (inp', (x1, x2))

This parser first uses onlyA to parse out an 'a'.  If this fails the
whole parse fails.  Otherwise it tries to parse out a 'b' from the
remaining input.  If this fails the whole parse fails.  Otherwise
the combined result is returned. 

It would be tedious to write every combined parser in this way.  Instead
we write a combining function that returns a new parser built out of
two smaller parsers.  This combining function works in exactly the same
way as parseAB did.  It first runs one parser and gets its results.
Then it runs the second parser and gets its results.  However, for
added flexibility instead of returning both results as a pair, it passes
the result from the first parser as a parameter to the second parser.
If thats unclear hold on for a second and I'll explain with some examples.

> andThen :: Parser a -> (a -> Parser b) -> Parser b
> p1 `andThen` f = Parser parseBoth
>     where parseBoth in1 = 
>             case runParser p1 in1 of
>                 Nothing        -> Nothing
>                 Just (in2, r2) -> runParser (f r2) in2

To illustrate lets make a parser that parses 'a' and then the same
character (whatever it might have been) again. 

> parseAA :: Parser Char
> parseAA = parseA `andThen` (\r -> char r)

The function involving "r" might look strange at first so let me draw
an analogy that will make it clear what is happening.  The expression:

    > (\r -> r + 5) 3

means bind three to "r" in the expression "r + 5".  It is very similar
to:

    > let r = 3 in r + 5

Similarly the expression

    > parseA `andThen` (\r -> char r)

is similar to saying "let r be the result from the parser parseA
in the expression charr r".  Its simply used as a way to pass around
the intermediate parsed expressions.  The meaning of parseAA can also
be seen by expanding its definition:

    > (parser onlyA) `andThen` (\r -> (Parser (onlyCh r)) = Parser parseBoth
    >     where parseBoth in1 = 
    >             -- case runParser (Parser onlyA) in1 of
    >             case onlyA in1 of
    >                 Nothing        -> Nothing
    >                 Just (in2, r2) -> 
    >                 Just (in2, r2) -> 
    >                     -- runParser (\r -> Parser (onlyCh r)) in2
    >                     runParser (Parser (onlyCh r2))

AndThen is useful for combining parsers, but it is not yet easy to
combine the results of parsers.  For that we'll need one more helper.
It builds a parser that doesnt parse anything at all, but returns a
value.  This parser sounds very boring but it is more useful than it
might first seem.

> setResult :: a -> Parser a
> setResult x = Parser constfunc 
>   where constfunc inp = Just (inp, x)

setResult can make a parser that returns any value we want to return.
Using andThen and setResult we can write a much simpler version of
parseAB:

> parseAB' :: Parser (Char, Char)
> parseAB' = 
>   parseA   `andThen` (\r1 ->
>   char 'b' `andThen` (\r2 ->
>   setResult (r1, r2)))

This probably looks a little scary at first.  Keep in mind the functions
involving r1 and r2 are simply used to bind intermediate results.  You
can think of this code as being similar to the imperative code:

    let r1 = parseA in
        let r2 = char 'b' in
            (r1, r2)

also do to operator precedence the whole thing can be written without
all the parenthesis:

> parseAB'' = 
>   parseA   `andThen` \r1 ->
>   char 'b' `andThen` \r2 ->
>   setResult (r1, r2)



It's a Monad
---

The pattern of combining parsers and building inert parsers just for
their results is a common one and it has a scary name: Monad.  Our
parsers form a monad.  The only reason I bring that up is that it lets
us use some syntactic sugar built into the Haskell language.  All we 
have to do to get access to the sugar is tell it about our monad:

> instance Monad Parser where
>     return = setResult
>     (>>=)  = andThen

Now we can write parsers using do-notation.  For example, the parseAB'
function can be rewritten as:

> parseAB''' :: Parser (Char, Char)
> parseAB''' = do
>   r1 <- parseA
>   r2 <- char 'b'
>   return (r1, r2)

This means exactly the same thing as the earlier definition using
andThen and setValue.  The do-notation is syntactic sugar that
automatically puts in the calls to andThen.

Another advantage of defining Parser as a monad is that we can use
any of a number of already defined monad functions.  But we won't
need to in this tutorial.



Another Combination
---

The "andThen" function defined earlier combine

Another Combination
---

The "andThen" function defined earlier combines two parsers into a new 
parser that runs the first and then runs the second.  This combination
parses longer strings that are concatenations of strings parsed by
the two parsers.  Another common way to combine parsers is to parse
a string with one parser or with another parser.  This corresponds to
alternation in a grammar.  

Lets start simple and write a parser by hand that parses the character
'a' or the character 'b'.  The parser will return whichever of the
two characters it parsed, or signify an error if the parse fails.

> parseAorB :: Parser Char
> parseAorB = Parser onlyAorB
> onlyAorB inp = 
>    case onlyA inp of
>        Nothing         -> onlyCh 'b' inp
>        Just (resid, v) -> Just (resid, v)

The parser tries to parse the 'a' first and if it succeeds returns its
value.  If it fails, it tries to parse the 'b' on the same input and
returns its result.  This pattern can be generalized to combine any
two parsers.  To simplify notation we'll use an infix function name
(and make it right-associative):

> infixr 1 <|>
>
> (<|>) :: Parser a -> Parser a -> Parser a
> p1 <|> p2 = Parser either
>     where either inp = case runParser p1 inp of
>             r@(Just _) -> r
>             Nothing    -> runParser p2 inp

Now we can rewrite parseAorB more simply as:

> parseAorB' :: Parser Char
> parseAorB' = parseA <|> char 'b'

You may have noticed something strange about the way <|> was defined:
It only tries to run the second parser if the first parser fails!
What happens if the first parser parses the string "ab" and the second
parser parses the string "about"?  The combined parser will parse the
first two characters as "ab" and then stop.  The parser will never
get a chance to match the full string.  The <|> function could be
rewritten to try both parsers and use the longest match, or possibly
try all matches.  This would make the parser much slower and for efficiency
sake, we will leave it up to the programmer to understand this deficiency.
The <|> function is not equivalent to alternation.  It is similar, but
is only equivalent if the strings parsed by the first parser are never
a prefix of strings parsed by the second parser.


Building Bigger
---

We now have the tools necessary for parsing complicated grammars.
For example, if we want a parser that parses the string "hello"
we could write:

> parseHello :: Parser String
> parseHello = do
>   c1 <- char 'h'
>   c2 <- char 'e'
>   c3 <- char 'l'
>   c4 <- char 'l'
>   c5 <- char 'o'
>   return [c1, c2, c3, c4, c5]

We could also write a general string parser function in terms of
char.  In this case, however, its easier to just write a new
parsing primitive:

> string :: String -> Parser String
> string xs = Parser (strParser xs)
> strParser :: String -> String -> Maybe (String, String)
> strParser xs inp = 
>     if pre == xs
>         then Just (resid, pre)
>         else Nothing
>     where (pre, resid) = splitAt (length xs) inp

We often want to parse a large set of alternations.  We could use
<|> to build a parser but this can get tedious.  For example:

> octalChar :: Parser Char
> octalChar = char '0' <|> char '1' <|> char '2' <|> char '3' <|> 
>   char '4' <|> char '5' <|> char '6' <|> char '7' 

Its easier to write a primitive to handle character classes.  We'll
do this in two steps -- with a parser that can parse any single
character that matches a predicate, and then with a parser that
parses any character from a list of characters:

> charPred :: (Char -> Bool) -> Parser Char
> charPred p = Parser (onlyPred p)
> onlyPred :: (Char -> Bool) -> String -> Maybe (String, Char)
> onlyPred p (x:xs) | p x  = Just (xs, x)
> onlyPred p _             = Nothing
>
> oneOf :: [Char] -> Parser Char
> oneOf xs = charPred (`elem` xs)

now octalChar can be rewritten more simply as:

> octalChar' :: Parser Char
> octalChar' = oneOf "01234567"


Repetition and Options
---

We may want to run a parser multiple times as long as it
is matching input.  We can turn a parser for any type a into a
parser for lists of as:

> many :: Parser a -> Parser [a]
> many p = Parser manyfunc where
>     manyfunc inp = case runParser p inp of
>         Nothing        -> Just (inp, [])
>         Just (inp2, x) -> 
>             case manyfunc inp2 of
>                 Nothing         -> Nothing -- impossible
>                 Just (inp3, xs) -> Just (inp3, x : xs)

And sometimes we want to parse one or more items in a list:

> many1 :: Parser a -> Parser [a]
> many1 p = do
>     x <- p
>     xs <- many p
>     setResult (x : xs)

And sometimes we want to parse something if it is present, but
ignore it if it is not.  In this case we will return either the
parsed value or some compatible value signifying that the item
wasn't present:

> option :: a -> Parser a -> Parse

> option :: a -> Parser a -> Parser a
> option failval p = p <|> setResult failval

A convenient way to signify an optional value is with Maybe.  If the
value is present Just val is returned, otherwise Nothing is returned:

> optionMaybe :: Parser a -> Parser (Maybe a)
> optionMaybe p = parseJust p <|> setResult Nothing
> parseJust :: Parser a -> Parser (Maybe a)
> parseJust p = do
>   x <- p
>   return (Just x)



Putting it together
---

Faith without works is dead, or so they say.  Lets put together a
small parser using our new toys.  Here's a small parser for request
URIs that are made of paths and queries in the HTTP protocol. 
The grammar being implemented is shown in the comments before each
parser definition.

> -- a parser for Request-URI
> -- Request-URI    = abs_path [ "?" query ] 
> request_uri :: Parser ([String], Maybe String)
> request_uri = do
>     path <- abs_path
>     q <- optionMaybe (do
>         char '?'
>         query)
>     return (filter (not.null) path, q)
> 
> 
> -- a parser for abs_path:
> -- abs_path      = "/"  path_segments
> -- path_segments = segment *( "/" segment )
> abs_path :: Parser [String]
> abs_path = many1 (do { char '/'; segment })
> 
> -- a parser for segment:
> -- segment       = *pchar
> segment :: Parser String
> segment = many pchar
> 
> -- a parser for query:
> -- query         = *uric
> query :: Parser String
> query = many uric
> 
> -- a parser for pchar, uric, unreserved,
> -- These all parse a single character from a set of characters.
> -- uric          = reserved | unreserved | escaped
> -- reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
> --                 "$" | ","
> -- unreserved    = alphanum | mark
> -- mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
> --                 "(" | ")"
> -- alphanum      = alpha | digit
> -- alpha         = lowalpha | upalpha
> -- lowalpha      = ['a'..'z']
> -- upalpha       = ['A'..'Z']
> -- digit         = ['0'..'9']
> --
> -- combining these gives:
> -- pchar      = {unreserved} | {escaped} |    [:@&=+$,]
> -- uric       = {unreserved} | {escaped} | [;/?:@&=+$,]
> -- unreserved = {lower} | {upper} | {digit} | [-_.!~*'()]
> pchar, uric, unreserved :: Parser Char
> pchar = escaped <|> unreserved <|> oneOf ":@&=+$,"
> uric  = escaped <|> unreserved <|> oneOf ";/?:@&=+$,"
> unreserved = lower <|> upper <|> digit <|> oneOf "-_.!~*'()"
> lower = oneOf ['a'..'z']
> upper = oneOf ['A'..'Z']
> digit = oneOf ['0'..'9']
> 
> -- a parser for escaped, returning its decoded value:
> -- escaped       = "%" hex hex
> escaped :: Parser Char
> escaped = 
>     char '%' >>= \_ ->
>     hexDigitVal >>= \v1 ->
>     hexDigitVal >>= \v2 ->
>     return (chr (16*v1 + v2))
> 
> -- a parser for a hex, returning its value:
> -- hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
> --                         "a" | "b" | "c" | "d" | "e" | "f"
> hexDigitVal :: Parser Int
> hexDigitVal = hd1 <|> hd2 <|> hd3
>     where hd1 = oneOf ['0'..'9'] >>= \ch ->
>                 return (ord ch - ord '0')
>           hd2 = oneOf ['a'..'f'] >>= \ch -> 
>                 return (10 + ord ch - ord 'a')
>           hd3 = oneOf ['A'..'F'] >>= \ch -> 
>                 return (10 + ord ch - ord 'A')
>
> -- Set subtraction -- remove characters from a string.
> without :: [Char] -> [Char] -> [Char]
> xs `without` [] = xs
> xs `without` (y:ys) = (filter (/= y) xs) `without` ys


And we have a small set of tests to exercise our parser and print
out the results:

> -- test a parser
> test :: Show a => Parser a -> String -> IO ()
> test p xs = do
>     print xs
>     case (parseString p xs) of
>         Nothing -> putStrLn "*** parse error"
>         Just x  -> print x
>     putStrLn ""
> testuri = test request_uri
> 
> -- run some tests
> main :: IO ()
> main = do
>     testuri "/etc/passwd"
>     testuri "/~newsham/foo.cgi?test=bar&this=that"
>     testuri "/%7enewsham/foo.cgi?test=bar&this=that"
>     testuri "/%7newsham/foo.cgi?test=bar&this=that"