Parser

The next stage in our project is parsing: taking a line of text from the user (or elsewhere), and creating the data objects it represents. Naturally the user might type something which does not represent an object according to our definitions, in which case we must have some way to signal an error.

Errors

Here is a definition of an Error type:

var (
    // Error_EndOfInput is returned at end of input.
    Error_EndOfInput = fmt.Errorf("eof")
    // Error_Syntax is returned for almost every error parsing.
    Error_Syntax = fmt.Errorf("syntax")
)

If, like me, you learned to program in BASIC on microcomputers, you will be familiar with the dreaded SYNTAX ERROR. Now is our chance to see things from the other side of the fence. Most of our functions from now on will return an Error to indicate whether and how something went wrong.

Lexer

I have no formal training in CS, but as far as I understand it the idea is to split a string up into tokens, which are both "words" and "punctuation", and discard any insignificant white space. So if the input is:

(foo bar)

Then the four tokens are:

( foo bar )

So let's start by creating a lexer, which will return the next token and the remainder of the input each time it is called. If it can't find a token (meaning that the input is empty or entirely whitespace), it will return nil for both token and remainder.

var (
    // whitespace separates tokens and is generally ignored
    whitespace = []byte{' ', '\t', '\r', '\n'}
    // prefix characters can appear before another token but aren't
    // necessarily delimiters
    prefix = []byte{'(', ')', '\''}
    // delimiters are characters that are not allowed in a symbol.
    // at the minimum, this must include all whitespace and
    // reserved characters.
    delimiters = []byte{' ', '\t', '\r', '\n', '(', ')'}
)

// lex extracts the next token from the input after skipping
// comments and leading whitespace. on end of input, it
// returns nil for both the token and the remainder.
func lex(input []byte) (token []byte, remainder []byte) {
    // skip whitespace
    input = skipws(input)

    // check for end of input
    if len(input) == 0 {
        return nil, nil
    }

    // check for prefix characters
    if bytes.IndexByte(prefix, input[0]) >= 0 {
        token, remainder = input[:1], input[1:]
        return token, remainder
    }

    // if we get here, the token is a symbol.
    // collect and return all the characters up to the first delimiter.
    token, remainder = runto(input, delimiters)
    return token, remainder
}

// runof splits the input in two. the first part is the prefix from input that
// includes delimiters. the second is the remainder of the input.
func runof(input, delim []byte) ([]byte, []byte) {
    for n, ch := range input {
        if bytes.IndexByte(delim, ch) == -1 {
            // ch is NOT a delimiter, so split here
            return input[:n], input[n:]
        }
    }
    // the entire input consists delimiters
    return input, nil
}

// runto splits the input in two. the first part is the prefix from input that
// does not include any delimiter. the second is the remainder of the input,
func runto(input, delim []byte) ([]byte, []byte) {
    for n, ch := range input {
        if bytes.IndexByte(delim, ch) != -1 {
            // ch IS a delimiter, so split here
            return input[:n], input[n:]
        }
    }
    // there are no delimiters in the input
    return input, nil
}

// skipws skips whitespace characters.
func skipws(input []byte) []byte {
    _, input = runof(input, whitespace)
    return input
}

We should write test cases for the lexer:

ID	Input	Tokens
1	""	""
2	"42"	"42", ""
3	"(foo bar)"	"(", "foo", "bar", ")", ""
4	"(s (t . u) v . (w . nil))"	"(", "s", "(", "t", ".", "u", ")", "v", ".", "(", "w", ".", "nil", ")", ")", ""
5	"a(b)c\n"	"a", "(", "b", ")", "c", ""

Parser

Now we can think about the parser itself. The entry point is read_expr, which will read a single (possibly complex) object and return the expression, the remainder of the input, and any errors found.

func read_expr(input []byte) (expr Atom, remainder []byte, err error)

We will first deal with the simple data: integers, symbols and NIL.

// read_atom reads an atom (a number or symbol) from the input.
// if it's a symbol, we assume that the caller has parsed it already
// and do no checking that it is a valid symbol.
func read_atom(input []byte, result *Atom) error {
    if val, err := strconv.Atoi(string(input)); err == nil { // it is an integer
        *result = make_int(val)
        return nil
    }
    // it is a symbol, but we must treat NIL specially.
    if label := bytes.ToUpper(input); bytes.Equal(label, []byte{'N', 'I', 'L'}) {
        // it is NIL and NIL must never be added to the symbol table.
        *result = _nil
    } else {
        *result = make_sym(label)
    }
    return nil
}

Notice two things: first, we are converting the input to upper case. This isn't strictly necessary — there's nothing wrong with having a case-sensitive LISP — but it is the traditional behaviour. Secondly, NIL is a special case: it's parsed directly as AtomType_Nil, rather than leaving it as a symbol.

If you're familiar with the various dialects of LISP then you will know that NIL is not necessarily the same as (), the empty list. We could choose to treat NIL as a symbol which evaluates to itself, but for this project we will consider both representations to be exactly the same.

Next up are lists (including improper lists and pairs). The simplified list syntax makes this a little complicated, so we'll stick it all in a helper function. Once again recursion allows us to deal with nested lists.

// read_list reads the next list from the input.
// it returns the remainder of the input or an error.
func read_list(input []byte, result *Atom) (remainder []byte, err error) {
    // set the result to NIL in case we read an empty list.
    *result = _nil

    var token []byte
    var tail Atom
    for token, remainder = lex(input); token != nil; token, remainder = lex(input) {
        // check for ")"
        if bytes.Equal(token, []byte{')'}) {
            // result holds the list.
            // return the remainder and no error
            return remainder, nil
        }

        // check for "."
        if bytes.Equal(token, []byte{'.'}) {
            // a dotted list must look like "(x . y)" or it is an improper list
            if nilp(tail) {
                // dot can't start a list, so this is an improper list
                return nil, Error_Syntax
            }

            // read the next expression and set the cdr of the current atom to it
            var expr Atom
            remainder, err = read_expr(remainder, &expr)
            if err != nil {
                // return the error
                return nil, err
            }
            setcdr(tail, expr)

            // read the closing paren
            token, remainder = lex(remainder)
            if !bytes.Equal(token, []byte{')'}) {
                // no closing paren, so this is an improper list
                return nil, Error_Syntax
            }

            // result holds the list.
            // return the remainder and no error
            return remainder, nil
        }

        // read the next expression
        var expr Atom
        remainder, err = read_expr(input, &expr)
        if err != nil {
            // return the error
            return nil, err
        }

        // and append it to the tail of the list
        if nilp(tail) {
            // first item in the list, so create a new list
            *result = cons(expr, _nil)
            tail = *result
        } else {
            // append to tail, then update tail
            setcdr(tail, cons(expr, _nil))
            tail = cdr(tail)
        }

        // at this point:
        //    result is the head of the list
        //    tail   is the last item in the list

        // update the input and loop back to parse the remainder of the input
        input = remainder
    }

    // eof is an error since lists must end with a close paren.
    return nil, Error_Syntax
}

// setcdr is a helper function to set the cdr of a pair.
// panics if p is not a pair.
func setcdr(p, a Atom) {
    p.value.pair.cdr = a
}

This takes advantage of the fact that the "zero value" for an Atom has _type set to AtomType_Nil. We also created a setcdr function to make the code a little easier to read.

Finally we have read_expr itself, which is very simple now that we have done all the hard work:

// read_expr reads the next expression from the input. an expression is
// either an atom or a list of expressions. returns the expression along
// with the remainder of the input and any errors.
// returns NIL and Error_EndOfInput on end of input. the caller must
// decide how to handle it.
// todo: result is not always updated by read. does that lead to bugs later?
func read_expr(input []byte, result *Atom) (remainder []byte, err error) {
    token, rest := lex(input)
    if token == nil { // end of input
        return nil, Error_EndOfInput
    }

    switch token[0] {
    case '(':
        return read_list(rest, result)
    case ')':
        // unexpected close paren
        return nil, Error_Syntax
    }
    err = read_atom(token, result)
    return rest, err
}

The check for a closing bracket will catch invalid forms such as

and

(X .)

Testing

We can use the parser to create table driven tests:

func TestChapter03(t *testing.T) {
    // test the read_expr function
    for _, tc := range []struct {
        id     int
        input  string
        expect string
    }{
        {id: 10, input: "42", expect: "42"},
        {id: 11, input: "(foo bar)", expect: "(FOO BAR)"},
        {id: 12, input: "(s (t . u) v . (w . nil))", expect: "(S (T . U) V W)"},
        {id: 13, input: "()", expect: "NIL"},
    } {
        // reset the symbol table
        sym_table = _nil

        input := []byte(tc.input)
        var expr Atom
        _, _ = read_expr(input, &expr)
        got := expr.String()
        if tc.expect != got {
            t.Errorf("%d: want %q: got %q\n", tc.id, tc.expect, got)
        }
    }
}

ID	Input	Output
10	"42"	"42"
11	"(foo bar)"	"(FOO BAR)"
12	"(s (t . u) v . (w . nil))"	"(S (T . U) V W)"
13	"()"	"NIL"

Looks good! Remember that () is exactly the same as NIL, and that (X Y) is just another way of writing (X . (Y . NIL)).