Challenging LR Parsing

(rust-analyzer.github.io)

121 points | by dilap 14 days ago

10 comments

  • cwzwarich 14 days ago

    > There’s a lot of literature about error recovery for LR parsers, how come academics haven’t figured this out already? I have a bold claim to make: error-recovery research in academia is focusing on a problem irrelevant for IDEs. Specifically, the research is focused on finding “minimal cost repair sequence”

    The parsing literature can be a bit difficult to navigate because the field is effectively dead for research purposes. However, other models of errors have been considered. Tim Wagner's PhD thesis is the best discussion of the application of LR parsing techniques in IDEs, and this paper extracted from it discusses error recovery in particular:

    http://harmonia.cs.berkeley.edu/papers/twagner-er.pdf

    In the non-interactive context, Richter's work on error recovery by substring parsing is also very interesting, in that it doesn't cause spurious errors when attempting to recover from errors, and it only depends on the language and not the grammar or parsing technique used:

    https://dl.acm.org/doi/10.1145/3916.4019

    For LR grammars (and many other grammars), the necessary substring parsing can be done in linear time using variations of the GLR algorithm.

    • raphlinus 14 days ago

      In the open source world, both Tree-sitter and Lezer are based on GLR, and cite Tim Wagner's thesis. So, while it might be possible to do better even staying in the GLR world, I think the comparison based on Tree-sitter is fair.

      I think these projects show that GLR is viable (there's a good set of grammars written for Tree-sitter), but I am also very willing to believe that the parser in rust-analyzer is more finely tuned for the specific needs of a Rust IDE.

      • maxbrunsfeld 14 days ago

        While Tree-sitter is heavily influenced by Wagner's thesis, our error recovery strategy actually uses a novel approach, so it is fair to question whether Wagner's would have different properties.

        For error recovery, Tree-sitter is currently optimized for some goals that are a bit different from the goals of a single-language IDE like Rust-Analyzer:

          * Like RA, we want to minimize the impact of syntax errors on basic features like syntax highlighting and symbol name extraction (which are important to GitHub, and in lightweight text editors like Neovim and Atom).
          * Unlike RA, I wanted to avoid requiring grammar authors to think about error recovery at all. 
          * We needed to tightly control the performance cost of pathological error recovery cases, like parsing a large Ruby file as Go (either accidentally or adversarially).
        
        My understanding is that Wagner's thesis is mostly concerned with "history-sensitive" error recovery in an IDE: using earlier syntax trees as an additional input to the error recovery process in order to better understand the user's intent when editing.

        I think this idea makes a lot of sense, but it doesn't address cases like Matklad has prevented here, where the IDE is presented with already-invalid file.

        • ltratt 14 days ago

          > My understanding is that Wagner's thesis is mostly concerned with "history-sensitive" error recovery in an IDE: using earlier syntax trees as an additional input to the error recovery process in order to better understand the user's intent when editing.

          That matches my understanding (note that Wagner's error recovery algorithm is probably best supplemented with Lukas Diekmann's thesis, which corrects and expands it a bit https://diekmann.co.uk/diekmann_phd.pdf).

          Because Wagner's algorithm relies on history and context, it does sometimes do some unexpected (and not always useful things). However, I don't see any fundamental reason why it couldn't be paired with a "traditional" error recovery approach (I have one to sell!) to try and get the best of both worlds.

          • maxbrunsfeld 14 days ago

            Yeah, I agree. To really demonstrate the best possible IDE error recovery that LR/GLR can provide, pairing a robust batch error recovery strategy with Wagner's (and your and Lukas's) history-sensitive approach is probably the way to go.

            At the time that I was implementing error recovery for Tree-sitter, that felt beyond my "complexity budget", since batch error recovery was already not trivial, and the batch approach was good enough for the features I was building on GitHub/Atom. Someday, I could imagine augmenting it with history-sensitive behavior. I am thankful that Lukas has helped to "pave that path" so thoroughly.

        • cwzwarich 14 days ago

          I was only commenting on his claims about the academic literature on parsing, not whether he should use the techniques verbatim.

          The problem with structured approaches to incremental parsing is that you rarely want incremental parsing alone. You’re going to want to do something else with the resulting parse tree (name resolution, type checking, compilation, etc.) and no structured approach to incremental parsing will help you there. Once you’re stuck writing an incremental type checker by hand, you probably have the right framework for also writing an incremental parser.

          Rust has additional complications, e.g. the macro system and the fact that the language is defined by a single implementation.

          • raphlinus 14 days ago

            I can confirm that Rust's macro system presents massive difficulties. For example, rustfmt is currently broken for subdirectory in Druid because of the use of a `cfg_if!` macro. We filed a bug (https://github.com/rust-lang/rustfmt/issues/4266) but it got closed without (as far as I can tell) actually being fixed. Our present approach is to remove the use of the macro, but this is a bit of a bad sign: two features (macros and automatic formatting) that work fine individually, but their interactions break.

        • tlb 14 days ago

          For error recovery, do any parsers have a notion of which tokens are more likely to be wrong?

          For example, in an actively edited C++ file, braces are more likely correct than angle brackets which are overloaded. So it's reasonable to auto-insert closing > if you see a } with an open <, but not reasonable to insert a } if you see a > with an open <.

          • ltratt 14 days ago

            > For error recovery, do any parsers have a notion of which tokens are more likely to be wrong?

            lrpar has a %avoid_insert declaration which allows users to say "these token types are better avoided if there are other equivalently good repair sequences". It's simple, but it works well https://softdevteam.github.io/grmtools/master/book/errorreco...

            • maxbrunsfeld 14 days ago

              I like this solution, and have thought about adding something similar as an "advanced feature" in Tree-sitter grammars. I'm interested to see how you've used it in grammars.

            • raphlinus 14 days ago

              To the extent that there will be further research on this problem, I think it will be driven by machine learning. We're already seeing this with autocompletion with TabNine, and no reason a similar approach couldn't work for the other issues facing draft code.

              If you had a large dataset of keystroke-by-keystroke edit streams, you'd be able to match up erroneous code with the eventual fix. Google or Facebook could absolutely capture these datasets for their internal work if they wanted, and I suspect the results could be scary-good. They wouldn't be able to release the models, though, even for their open source work (Chromium and Android), because it would have such a high chance of revealing internal secrets.

        • rstuart4133 13 days ago

          Oh ... so that is what a Pratt parser is. Thanks for the nice clear writeup.

          But methinks it should be called a Wirth parser. Nicholas Wirth invented the same thing in the 1970's. He initially used it to parse his first language, Pascal. He drew them as Syntax diagrams: https://en.wikipedia.org/wiki/Syntax_diagram The loops that characterise Pratt are evident in those diagrams.

          You can read his book on Pascal here: https://www.research-collection.ethz.ch/bitstream/handle/20.... At the end you will find the syntax diagrams for the entire language. The compiler was derived from those diagrams in the straight forward way described in the article.

          And everything the article says is true, if you are going to manually write a parser, they are by far the easiest starting point. And I don't doubt error messages and error recovery are much easier too. Parser generators don't give you the same flexibility to play with a parsers guts.

          But you do lose some things. A parser generator checks for ambiguities, something you can easily miss in a hand written parser. A parser generator also enforces a clean split between the parser and whatever you are doing with the syntax tree, whereas hand written parsers lead you toward a ball of mud where one lump of code has all the tasks intermixed.

          And there are LR parsers out there that support the loops and priorities that raw BNF productions make so hard. Like mine :D http://lrparsing.sourceforge.net/doc/html/

          • sreekotay 14 days ago

            You know --- more and more --- I kinda like the REBOL approach --- forget precedence.

            Rather, just go L to R except where superceded by parens. Explicit and easy to understand.

            (Though I say that not having lived it in production, lol - so take with a grain of salt)

            Over time, readability/grokability tends to trump other factors...

            • jcranmer 14 days ago

              Because parsing "a += b * c" as "(a += b) * c" is a wonderful, intuitive program.

              (This does assume that op= is an expression in your language.)

              • sreekotay 14 days ago

                Edit: Realized what I was tring to say: Intuitive might be the wrong thing to chase.

                Intuitive to who? :) Trivial programs are trivial. We might be stuck in the BCPL derived syntax that trade a false notion of intuitive for complexity of implentation and maintenance. (And I do mean might --- it's not hard not to think incrementally from the languages we know) --- but "inituitive" is EXACTLY the problem: I guess I was arguing we might consider searching for simple and predictable (even if that comes at the cost of slightly higher initial learning?)

                • sreekotay 13 days ago

                  e.g. maybe write it as "b * c += a" ? ... lol full on LR - no ambiguiity.

                • phkahler 14 days ago

                  Including precedence rules will take care of that.

                  • masklinn 14 days ago

                    That… would be jcranmer's point, they're replying to a toplevel comment advocating for not having precedence.

                • dTal 14 days ago

                  The APL approach, which is mostly just simple RL-associative + grouping parentheses, works surprisingly well in practice - although that probably owes something to the simplicity of the syntax and the way idiomatic APL uses mainly the built-in 1- and 2-argument functions.

                  • sreekotay 14 days ago

                    That was the thought: is there some non-BCPL like grammar that makes sense. Like... (not fully formed thought here) JSON is easy but verbose - YAML is nightmare by cleaner. Maybe we're looking at it wrong....

                    • dTal 13 days ago

                      As a nearby comment observes, this leads to s-expressions.

                      The connection between APL syntax and s-expressions is surprisingly close. Start with prefix notation and right-associativity, add grouping parens, and you have something that looks like lisp. The Nial language is an outright fusion of the two that's fun and enlightening to play with.

                      https://en.wikipedia.org/wiki/Nial https://github.com/danlm/QNial7

                  • tlb 14 days ago

                    Smalltalk too. I really liked it.

                    In practice, C compilers warn you if you try to take advantage of && having higher precedence than ||, so other than a few very simple rules wise programmers use extra parens anyway.

                    • mamcx 14 days ago

                      I suspect is not that bad to force to group things manually (1 + (....)) but is truly precedence the biggest issue? I think even without that hand-coded parsers are still superior for error reporting and other factors?

                      • jraph 14 days ago

                        I just wrote a parser in TypeScript that parses itself (for fun, and not quite done yet).

                        Precedence and associativity is a painful aspect of parsing JavaScript… and probably C, Java and many programming languages that look like C. It took me a non negligible part of the time to think about how to handle this. With the presence of infix, suffix and prefix operators, and weird operators like new…(), ? … : …, and a lot of rules and exceptions, it's a lot of fun (for instance, yield is an operator in JavaScript. It takes an optional expression. However, 1 + yield is forbidden…). The grammar of JavaScript expressions is… not simple.

                        And then, you have Automatic Semicolon Insertion, which forces you to backtrack and snapshot / restore the lexer state when consuming white characters EVERYWHERE (well, not literally everywhere, but still) because things may behave differently if there is a new line. And of course, new lines cannot happen in some constructs (there are a few exceptions to the general behavior of ASI), so you need to take care of this too.

                        And then you have the comma operator that is allowed in some places taking an expression (especially in older constructs, and disallowed in other places (especially in newer constructs). So you need to handle this too.

                        Parsing the rest is easy. It's just that there are a lot of things in the language so it's a bit long to write.

                        • mamcx 14 days ago

                          Even with pratt parsing? Or using other method?

                          • masklinn 14 days ago

                            I can only expect an other method as both are trivial using a pratt parser (though right-associativity still feels a bit weird).

                            • jraph 14 days ago

                              Indeed, I wrote it by hand, the naive way.

                      • hinkley 14 days ago

                        The standing rule has always been 5th grade math comprehension.

                        As with most things, switching from too much to none is generally not an improvement.

                        Perhaps languages should have 8 levels of precedence instead of 18.

                        • ChrisSD 14 days ago

                          That kind of thinking inevitably leads to S-expressions.

                          • sreekotay 14 days ago

                            You might be right about that. Maybe we need more bracket flexibility or something. Or something built for TOOLING - nobody just writes freeform anymore?

                            • zokier 14 days ago

                              Or RPN

                              • sreekotay 14 days ago

                                Maybe - REBOL was interesting in the functions are all RL to precedence and operators are all LR precedence. It's clean(-ish) without being too verbose/messy. It's not perfect - just thought it was an interesting direction to ponder...

                          • Nelkins 14 days ago

                            Somehow F# manages to do a pretty good job (although maybe I've just gotten used to any quirks re: error recovery and messages). Anybody know if they're doing anything differently?

                            https://github.com/fsprojects/FsLexYacc

                            https://github.com/dotnet/fsharp/blob/main/src/fsharp/lex.fs...

                            • matthewaveryusa 14 days ago

                              Well said, there certainly is a chasm between the theory and what practically works

                              As a sidenote, if like myself you like to dabble first before jumping in to theory, I would recommend starting with Jisson -- it's so much fun and you'll learn a lot quickly!:

                              https://zaa.ch/jison/try/

                              https://nolanlawson.github.io/jison-debugger/

                              • thesz 14 days ago

                                The chart parsing (CYK and friends) will produce exactly what is needed - as little contiguous completely parsed parts of program as it is possible. The syntax tree from example above will be different, but not much - parts "fn f(" and "struct { ... }" will be adjacent, not included in partial tree.

                                There's a paper [1] about efficient parallel implementation possible.

                                [1] https://jyp.github.io/pdf/PP.pdf

                                Enjoy. ;)

                                • leeoniya 14 days ago

                                  relevant:

                                  parser recovery in codemirror6 (inspired by tree-sitter but adapted for IDEs):

                                  https://marijnhaverbeke.nl/blog/lezer.html

                                  a standalone incremental JS parser with error recovery:

                                  https://github.com/escaya/escaya

                                  • SloopJon 14 days ago

                                    This post is a next-level look at parsing interactively in an editor or IDE. Not only do you want it to be fast, but you want it to be resilient enough to provide useful output given incomplete input.

                                    Most of my poor attempts at parsing have been in the context of an offline tool. I find error handling hard enough using Yacc or Bison.

                                    I missed the post that this one responds to, so I'm going back and reading that discussion too:

                                    https://news.ycombinator.com/item?id=24480504

                                    • zamalek 13 days ago

                                      He suggests pratt parsing as a production parser which, while unbelievably elegant, is recursive. Correctly recovering from a stack overflow (without leaking memory or resources) is, at best, a very hard problem worth researching in academia. Recursion is generally not production code, its usefulness does not extend much further than confusing university students.

                                      That's the hidden benefit of parsers that use DFAs: they generally don't use recursion. They will parse something until you run out of memory, not stack space (which is at least 3 orders of magnitude smaller than the heap).

                                      • ghusbands 13 days ago

                                        It's trivial to replace the stack usage with a data structure. I've written a number of extended-Pratt parsers that don't use the stack.

                                        Also, a significant number of successful parsers have used recursion. As long as you successfully handle stack overflow and only hit it on pathological inputs, it works reasonably well.

                                        • zamalek 12 days ago

                                          > It's trivial to replace the stack usage with a data structure.

                                          Interesting. Do you have any examples of that, reading the article actually had me thinking how that would be achieved.

                                          • flashmozzg 7 days ago

                                            You could trivially (or rather, mechanically) emulate HW bounded stack with a heap-based stack collection (like std::stack).