Section 2

note

You're not responsible for materials here, unless mentioned as required explicitly in lectures. Please just remember that I have present you such things, and when you really need it later this semester, you can always come back here. Also, if you know everything here, and you don't have questions, you are then not required to attend discussions.

Misc

Here are some pointers or brief intros about some terminologies.

Prefix Notation

It is like treating everything as a function call as in C. We write 3 + 2 > 4 in the infix world for readability, but we write prefix (> (+ 3 2) 4) for easier parsing. If we think ">" and "+" as function names, they are simply something like func1 (func2 (3, 2), 4). Please refer to this wiki page.

Python

Again, please install Python 3, not Python 2. They are almost different in any way, so please use Python 3 or otherwise your homework won't compile during grading process.

Here are some pretty good documentations about Python.

JSON

JSON stands for JavaScript Object Notation. It is a format designed for data interchange, similar to XML. You can find its official definition at http://www.json.org/. On the right side of that webpage, there is a floating box containing all the symbols and their meanings in JSON which are written in a BNF-like way.

Here is a quick example of a JSON object.

{
    "key": "string value",
    "another key": ["array", "value"],
    "third key": 123,
    "last key": {
        "inner key": "embedded object",
        "null key": null
    }
}

And http://docs.python.org/3/library/json.html is the Python standard library for processing JSON.

Parsing and LL(1)

Generally speaking, there are two ways of parsing: top-down parsing and bottom-up parsing. LL(1) parsers use top-down parsing to parse languages generated by LL(1) grammar, which is a subset of context-free grammars/languages.

Top-Down Parsing

Recall the parse tree of a statement. Suppose we have a grammar

note

For grammars this simple, you can write such a parser: Recursively generate all the possible strings of the language, and check whether the input program is a member of the language. But in practice, it is not possible to generate all possible strings. As a result, people invents parsing techniques to use grammar rules to generate a string that matches the input program as much as possible, in a smarter way.

And we have three sentences

  1. a * b.
  2. a ** b.
  3. a *** b.

A top-down parser works in the following way (for parsing #1).

  1. It starts from the starting symbol of the grammar, program in our case.
  2. Choose a choice/alternative to expand. Here we have only one choice, that is to expand program into a OP b
  3. a as in the production rule, matches the a as in the input. It then goes on parsing the next token.
  4. OP is not a terminal, we need to choose a choice/alternative. Then a OP b => a * b.
  5. * as in the alternative/choice then matches the * as in the input.
  6. Go on and on, and it finally successfully matches the #1 input. It returns the follwing tree.

    {program: [
        a, 
        {OP: [
            *
        ]}, 
        b
    ]}

Since we start from the root of parse tree, and it grows downwards, we call it top-down.

If we try to parse #2,

  1. Same as previous, until we choose a OP b => a * b. * does match the input, but the second * as in the input, doesn't match the b as in the rule. Therefore we trace back to choose another choice/alternative.
  2. This time, we try a OP b => a ** b, and we succeed.

    {program: [
        a, 
        {OP: [
            **
        ]}, 
        b
    ]}

If we try to parse #3,

  1. Same as previous. program => a OP b.
  2. a OP b => a * b, failed, backtrace.
  3. a OP b => a ** b, failed, backtrace.
  4. No more available alternatives/choices. Report a grammar error to the user.

So, this is the general idea of top-down parsing.

LL(1)

There are some problems about the top-down method we just introduced.

note

In today's section, we will use uppercase letters for non-terminals, and lowercase letters for terminals.

  1. Left recursion.

    If we have a production rule of the form P => P a, then it will be a problem since the parser will enter an infinite loop to expand the tree into P => P P P P P .......

  2. Backtracing.

    It is very costly if we backtrace from a very deep position, e.g. go back from P => A B C D E F G H I J K a and then try P => A B C D E F G H I J K b. We waste all the resources parsing A through K.

So people decide to construct a kind of grammar to eliminate such problems. That is to say, we are going to develop a kind of grammar, whose production rules don't contain left recursions, and they are carefully designed so that we don't need backtracing during parsing. LL(1) grammar is one of the grammars that meet our needs.

note

Recall that we have four types of grammars, one of which is context-free grammar. LL(1) grammar is a subset of context-free grammar.

So how does LL(1) solve those problems?

  1. For left recursions.

    People developed mathmatical approaches to eliminate left recursions from production rules, while the result is still equivalent to the original grammar.

  2. For backtracing.

    People found if the parser can't decide which alternative/choice to use next, or we say two alternative/choice are conflict, it will probably result in a backtrace or failure. There are generally two types of LL(1) conflicts.

    1. Example

      where S could be b, or b a. Both of them starting with b. As a result, when the parser reads in a b, it can't decide which alternative to use, unless it tries and backtraces for several times.

      note

      Please refer to the definition of LL(1) on our course webpage. Try to compare and understand it.

    2. Example

      where epsilon is the empty string. Here, if the parser reads in an a, it doesn't know if this a should match the a as in S => epsilon a b or the first a as in S => a a b. Therefore it can't decide if we should use E => a or E => epsilon.

    People also invents apporaches to eliminate such conflicts, like left-factoring.

If a grammar G don't have left recursions, and also don't have these two kinds of conflicts, then G is a LL(1) grammar. Such grammars can use a top-down methods withou backtracing.

It is called LL(1) because it uses leftmost-derivation, and the length of look-ahead is one. Leftmost-derivation is shown above in the top-down examples, for every time we expand the rule from left to right. Look-ahead is not shown however. It means how many input tokens shall we look-ahead before we can decide which choice to use. Here is one, which means I can decied which choice to apply by just looking the next one input token. (The match operation happens after the application of the production rule.)

You can search "LL(1) Example" for more examples. Like this one http://courses.cs.washington.edu/courses/cse401/04sp/slides/03b-LL1-example.pdf

Predictive Parser is a recursive descent parser that does not require backtracking.

note

I don't want to mess up things here, so I won't give you the formal definition of LL(1) grammar. If you want them, please read the dragon book (our textbook), or ask google. Here is a pointer to the wikipage, http://en.wikipedia.org/wiki/LL_parser.

Left-recursion Elimination by Example

For productions like this:


P → Pα1 ∣ Pα2 ∣ ⋯ ∣ Pαm ∣ β1 ∣ ⋯ ∣ βn

where α ≠ ε, β don’t start with P

It will be turned into


P → β1P′∣β2P′∣⋯∣βnP

P′→α1P′∣α2P′∣⋯∣αmP′∣ε

And you can verify that the resulting language is the same.

warning

This is actually eliminating direct left recursions, and turning them into right recursions. There are methods to eliminate all recursions, direct or indirect, but it is more complicated, and needs some restrictions on the input grammar.

Left-factoring by Example

For productions like this:


A → δβ1 ∣ δβ2 ∣ ⋯ ∣ δβn ∣ γ1 ∣ ⋯ ∣ γm

We turn them into


A → δA′∣γ1 ∣ ⋯ ∣ γm

A′→β1 ∣ ⋯ ∣ βn

Summary

  1. Parsing: Top-down v.s. Bottom-up
  2. Top-down: Starting from the top symbol of the grammar, and building the parse tree downwards
  3. Leftmost-derivation: Expanding rules from left to right, or you can think of building parse trees from left to right (while going downwards)
  4. General top-down pitfalls:

    1. left-recursions cause infinite loop when parsing
    2. backtracking causes a waste of computing resources
  5. LL(1), Leftmost-Derivation, Look-ahead one token

    1. LL(1) grammar is a subset of context-free grammar
    2. LL(1) parser is a recursive descent parser
    3. LL(1) grammar doesn't have left-recursions or those two kinds of conflicts listed above.
  6. Left-factoring: used to avoid backtracking
  7. Left-recursion elimination: used to avoid infinite loop