> module Main where
> import Data.Char (ord, chr)
How to write a parser in a functional language.
This is a short introduction into how to write a parser in a functional
language. The method isn't that different than in an imperative
language except there is more book keeping that needs to be done.
For example, the parser can't simply mark data as being consumed. It
has to pass around the remaining data as it parses. There's an upside,
too
from smaller parsers.
Parsing functions
We want to build a function that parses a string and returns parsed data.
The string might not always be parsable, though. Whatever parsing
function we end up writing will need a way to return failure. We'll
use "Maybe" for this:
> data Maybe a = Nothing | Just a
If the parse function can't parse the string it will return Nothing.
If it can parse the string it will return (Just a) with some returned
parse data "a". A fancier parser might return an error string to
signify what went wrong, but our parser wont.
The task now is to write a function that takes an input string and
returns parsed data or an error. We'll want our parser to have
a type like:
> parser :: String -> Maybe a
for some returned parse type a. For example:
> parseInt :: String -> Maybe Int
We're going to want to build up our parser from many smaller parsers
that only parse a part of the input. So instead we're going to use
a type like:
> parseInt :: String -> Maybe (String, Int)
The function will take a string and return Nothing if the string can't
be parsed, but return Just (resid, val) if the string can be parsed.
The result will have the unconsumed part of the input as well as the
return value. For example, to parse "123 abc" with parseInt we'd want:
> parseInt "123 abc" == Just (" abc", 123)
Finally we'll be dealing with a lot of functions of this type, so we'll
wrap it up neatly as a data type:
> data Parser a = Parser (String -> Maybe (String, a))
Each value of type "Parser a" will be a parsing function. For
example "Parser Int" will be a value of type
Parser (String -> Maybe (String, Int))
To put the function to use we'll define a few helpers. runParser runs
the enclosed parsing function on some input:
> runParser :: Parser a -> String -> Maybe (String, a)
> runParser (Parser p) inp = p inp
parseString will also run a parser, but will return the result if
the entire string was consumed or an error otherwise:
> parseString :: Parser a -> String -> Maybe a
> parseString p inp =
> case runParser p inp of
> Nothing -> Nothing
> Just (resid, val) -> if null resid
> then Just val
> else Nothing
For example, if we have a function intParser of type Parser Int which
parses an integer then we'd expect that
> runParser intParser "123 abc" == Just (" abc", 123)
and
> parseString intParser "abc" == Nothing
> parseString intParser "123" == Just 123
Some small parsers
What exactly does a parsing function look like? They're actually
pretty simple. Lets build a few. Lets start with a function
that parses the character 'a' and fails for any other input. The
parser will return the parsed character:
> parseA :: Parser Char
> parseA = Parser onlyA
> onlyA [] = Nothing
> onlyA ('a':xs) = Just (xs, 'a')
> onlyA _ = Nothing
Remember that String is just [Char], a list of characters. onlyA
will return the rest of the string and the character 'a' if the
first character is 'a', and will return Nothing otherwise. parseA
is a Parser Char for the onlyA function. If we run this parser on
the input "a" we get back Just 'a':
> parseString parseA "a" == Just 'a'
If we run this parser on any other input, we get back Nothing:
> parseString parseA "test" == Nothing
We could write a similar parser for the letter 'b'. Instead, let's
generalize by writing a function which makes an arbitrary character
parser:
> char :: Char -> Parser Char
> char ch = Parser (onlyCh ch)
> onlyCh ch (x:xs) | x == ch = Just (xs, x)
> onlyCh ch _ = Nothing
Now we can build a parser of 'b' or any other character we want very
easily:
> parseString (char 'b') "b" == Just 'b'
> parseString (char 'b') "test" == Nothing
Combining Parsers
We can continue writing more sophisticated parsers that work in
a similar manner, and we will return to this in just a bit. But often
its a lot simpler to build a parser by combining two existing parsers.
For example, lets say we wanted to parse the character 'a' followed
by the character 'b' and return the two parsed characters. We could
write this parser as:
> parseAB :: Parser (Char, Char)
> parseAB = Parser onlyAB
>
> parseAB :: Parser (Char, Char)
> parseAB = Parser onlyAB
> where onlyAB inp =
> case onlyA inp of
> Nothing -> Nothing
> Just (inp', x1) ->
> case (onlyCh 'b') inp' of
> Nothing -> Nothing
> Just (inp', x2) -> Just (inp', (x1, x2))
This parser first uses onlyA to parse out an 'a'. If this fails the
whole parse fails. Otherwise it tries to parse out a 'b' from the
remaining input. If this fails the whole parse fails. Otherwise
the combined result is returned.
It would be tedious to write every combined parser in this way. Instead
we write a combining function that returns a new parser built out of
two smaller parsers. This combining function works in exactly the same
way as parseAB did. It first runs one parser and gets its results.
Then it runs the second parser and gets its results. However, for
added flexibility instead of returning both results as a pair, it passes
the result from the first parser as a parameter to the second parser.
If thats unclear hold on for a second and I'll explain with some examples.
> andThen :: Parser a -> (a -> Parser b) -> Parser b
> p1 `andThen` f = Parser parseBoth
> where parseBoth in1 =
> case runParser p1 in1 of
> Nothing -> Nothing
> Just (in2, r2) -> runParser (f r2) in2
To illustrate lets make a parser that parses 'a' and then the same
character (whatever it might have been) again.
> parseAA :: Parser Char
> parseAA = parseA `andThen` (\r -> char r)
The function involving "r" might look strange at first so let me draw
an analogy that will make it clear what is happening. The expression:
> (\r -> r + 5) 3
means bind three to "r" in the expression "r + 5". It is very similar
to:
> let r = 3 in r + 5
Similarly the expression
> parseA `andThen` (\r -> char r)
is similar to saying "let r be the result from the parser parseA
in the expression charr r". Its simply used as a way to pass around
the intermediate parsed expressions. The meaning of parseAA can also
be seen by expanding its definition:
> (parser onlyA) `andThen` (\r -> (Parser (onlyCh r)) = Parser parseBoth
> where parseBoth in1 =
>
> case onlyA in1 of
> Nothing -> Nothing
> Just (in2, r2) ->
> Just (in2, r2) ->
>
> runParser (Parser (onlyCh r2))
AndThen is useful for combining parsers, but it is not yet easy to
combine the results of parsers. For that we'll need one more helper.
It builds a parser that doesnt parse anything at all, but returns a
value. This parser sounds very boring but it is more useful than it
might first seem.
> setResult :: a -> Parser a
> setResult x = Parser constfunc
> where constfunc inp = Just (inp, x)
setResult can make a parser that returns any value we want to return.
Using andThen and setResult we can write a much simpler version of
parseAB:
> parseAB' :: Parser (Char, Char)
> parseAB' =
> parseA `andThen` (\r1 ->
> char 'b' `andThen` (\r2 ->
> setResult (r1, r2)))
This probably looks a little scary at first. Keep in mind the functions
involving r1 and r2 are simply used to bind intermediate results. You
can think of this code as being similar to the imperative code:
let r1 = parseA in
let r2 = char 'b' in
(r1, r2)
also do to operator precedence the whole thing can be written without
all the parenthesis:
> parseAB'' =
> parseA `andThen` \r1 ->
> char 'b' `andThen` \r2 ->
> setResult (r1, r2)
It's a Monad
The pattern of combining parsers and building inert parsers just for
their results is a common one and it has a scary name: Monad. Our
parsers form a monad. The only reason I bring that up is that it lets
us use some syntactic sugar built into the Haskell language. All we
have to do to get access to the sugar is tell it about our monad:
> instance Monad Parser where
> return = setResult
> (>>=) = andThen
Now we can write parsers using donotation. For example, the parseAB'
function can be rewritten as:
> parseAB''' :: Parser (Char, Char)
> parseAB''' = do
> r1 <- parseA
> r2 <- char 'b'
> return (r1, r2)
This means exactly the same thing as the earlier definition using
andThen and setValue. The donotation is syntactic sugar that
automatically puts in the calls to andThen.
Another advantage of defining Parser as a monad is that we can use
any of a number of already defined monad functions. But we won't
need to in this tutorial.
Another Combination
The "andThen" function defined earlier combine
Another Combination
The "andThen" function defined earlier combines two parsers into a new
parser that runs the first and then runs the second. This combination
parses longer strings that are concatenations of strings parsed by
the two parsers. Another common way to combine parsers is to parse
a string with one parser or with another parser. This corresponds to
alternation in a grammar.
Lets start simple and write a parser by hand that parses the character
'a' or the character 'b'. The parser will return whichever of the
two characters it parsed, or signify an error if the parse fails.
> parseAorB :: Parser Char
> parseAorB = Parser onlyAorB
> onlyAorB inp =
> case onlyA inp of
> Nothing -> onlyCh 'b' inp
> Just (resid, v) -> Just (resid, v)
The parser tries to parse the 'a' first and if it succeeds returns its
value. If it fails, it tries to parse the 'b' on the same input and
returns its result. This pattern can be generalized to combine any
two parsers. To simplify notation we'll use an infix function name
(and make it rightassociative):
> infixr 1 <|>
>
> (<|>) :: Parser a -> Parser a -> Parser a
> p1 <|> p2 = Parser either
> where either inp = case runParser p1 inp of
> r@(Just _) -> r
> Nothing -> runParser p2 inp
Now we can rewrite parseAorB more simply as:
> parseAorB' :: Parser Char
> parseAorB' = parseA <|> char 'b'
You may have noticed something strange about the way <|> was defined:
It only tries to run the second parser if the first parser fails!
What happens if the first parser parses the string "ab" and the second
parser parses the string "about"? The combined parser will parse the
first two characters as "ab" and then stop. The parser will never
get a chance to match the full string. The <|> function could be
rewritten to try both parsers and use the longest match, or possibly
try all matches. This would make the parser much slower and for efficiency
sake, we will leave it up to the programmer to understand this deficiency.
The <|> function is not equivalent to alternation. It is similar, but
is only equivalent if the strings parsed by the first parser are never
a prefix of strings parsed by the second parser.
Building Bigger
We now have the tools necessary for parsing complicated grammars.
For example, if we want a parser that parses the string "hello"
we could write:
> parseHello :: Parser String
> parseHello = do
> c1 <- char 'h'
> c2 <- char 'e'
> c3 <- char 'l'
> c4 <- char 'l'
> c5 <- char 'o'
> return [c1, c2, c3, c4, c5]
We could also write a general string parser function in terms of
char. In this case, however, its easier to just write a new
parsing primitive:
> string :: String -> Parser String
> string xs = Parser (strParser xs)
> strParser :: String -> String -> Maybe (String, String)
> strParser xs inp =
> if pre == xs
> then Just (resid, pre)
> else Nothing
> where (pre, resid) = splitAt (length xs) inp
We often want to parse a large set of alternations. We could use
<|> to build a parser but this can get tedious. For example:
> octalChar :: Parser Char
> octalChar = char '0' <|> char '1' <|> char '2' <|> char '3' <|>
> char '4' <|> char '5' <|> char '6' <|> char '7'
Its easier to write a primitive to handle character classes. We'll
do this in two steps
character that matches a predicate, and then with a parser that
parses any character from a list of characters:
> charPred :: (Char -> Bool) -> Parser Char
> charPred p = Parser (onlyPred p)
> onlyPred :: (Char -> Bool) -> String -> Maybe (String, Char)
> onlyPred p (x:xs) | p x = Just (xs, x)
> onlyPred p _ = Nothing
>
> oneOf :: [Char] -> Parser Char
> oneOf xs = charPred (`elem` xs)
now octalChar can be rewritten more simply as:
> octalChar' :: Parser Char
> octalChar' = oneOf "01234567"
Repetition and Options
We may want to run a parser multiple times as long as it
is matching input. We can turn a parser for any type a into a
parser for lists of as:
> many :: Parser a -> Parser [a]
> many p = Parser manyfunc where
> manyfunc inp = case runParser p inp of
> Nothing -> Just (inp, [])
> Just (inp2, x) ->
> case manyfunc inp2 of
> Nothing -> Nothing
> Just (inp3, xs) -> Just (inp3, x : xs)
And sometimes we want to parse one or more items in a list:
> many1 :: Parser a -> Parser [a]
> many1 p = do
> x <- p
> xs <- many p
> setResult (x : xs)
And sometimes we want to parse something if it is present, but
ignore it if it is not. In this case we will return either the
parsed value or some compatible value signifying that the item
wasn't present:
> option :: a -> Parser a -> Parse
> option :: a -> Parser a -> Parser a
> option failval p = p <|> setResult failval
A convenient way to signify an optional value is with Maybe. If the
value is present Just val is returned, otherwise Nothing is returned:
> optionMaybe :: Parser a -> Parser (Maybe a)
> optionMaybe p = parseJust p <|> setResult Nothing
> parseJust :: Parser a -> Parser (Maybe a)
> parseJust p = do
> x <- p
> return (Just x)
Putting it together
Faith without works is dead, or so they say. Lets put together a
small parser using our new toys. Here's a small parser for request
URIs that are made of paths and queries in the HTTP protocol.
The grammar being implemented is shown in the comments before each
parser definition.
>
>
> request_uri :: Parser ([String], Maybe String)
> request_uri = do
> path <- abs_path
> q <- optionMaybe (do
> char '?'
> query)
> return (filter (not.null) path, q)
>
>
>
>
>
> abs_path :: Parser [String]
> abs_path = many1 (do { char '/'; segment })
>
>
>
> segment :: Parser String
> segment = many pchar
>
>
>
> query :: Parser String
> query = many uric
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> pchar, uric, unreserved :: Parser Char
> pchar = escaped <|> unreserved <|> oneOf ":@&=+$,"
> uric = escaped <|> unreserved <|> oneOf ";/?:@&=+$,"
> unreserved = lower <|> upper <|> digit <|> oneOf "-_.!~*'()"
> lower = oneOf ['a'..'z']
> upper = oneOf ['A'..'Z']
> digit = oneOf ['0'..'9']
>
>
>
> escaped :: Parser Char
> escaped =
> char '%' >>= \_ ->
> hexDigitVal >>= \v1 ->
> hexDigitVal >>= \v2 ->
> return (chr (16*v1 + v2))
>
>
>
>
> hexDigitVal :: Parser Int
> hexDigitVal = hd1 <|> hd2 <|> hd3
> where hd1 = oneOf ['0'..'9'] >>= \ch ->
> return (ord ch ord '0')
> hd2 = oneOf ['a'..'f'] >>= \ch ->
> return (10 + ord ch ord 'a')
> hd3 = oneOf ['A'..'F'] >>= \ch ->
> return (10 + ord ch ord 'A')
>
>
> without :: [Char] -> [Char] -> [Char]
> xs `without` [] = xs
> xs `without` (y:ys) = (filter (/= y) xs) `without` ys
And we have a small set of tests to exercise our parser and print
out the results:
>
> test :: Show a => Parser a -> String -> IO ()
> test p xs = do
> print xs
> case (parseString p xs) of
> Nothing -> putStrLn "*** parse error"
> Just x -> print x
> putStrLn ""
> testuri = test request_uri
>
>
> main :: IO ()
> main = do
> testuri "/etc/passwd"
> testuri "/~newsham/foo.cgi?test=bar&this=that"
> testuri "/%7enewsham/foo.cgi?test=bar&this=that"
> testuri "/%7newsham/foo.cgi?test=bar&this=that"