• haberman 14 days ago

    I work on the protobuf team at Google, and I'm a big fan of Rust, though I haven't written much actual Rust except a bunch of Project Euler solutions.

    For protobuf in C++, we've been moving more and more in the direction of using arenas for memory allocation. When you parse a protobuf, it creates a tree of objects that are usually all deleted at the same time. Freeing an arena is much, much cheaper than traversing the tree of objects and calling free() on each one.

    My dream has been that Rust protobuf could support arenas as well as C++, but use Rust's type system to make it all provably correct at compile time (in C++ the lifetime management is inherently manual and unsafe). For absolute top performance, arenas will always beat trees of unique pointers (which I think corresponds to Rust's Box<> type).

    I don't know Rust's type/lifetime system well enough to know if this is possible. I was looking recently at arenas in Rust and I noticed that Rust's version of placement new seems to be stalled:

    "Unfortunately the path forward for placement new in Rust does not look good right now, so I've reverted this crate to work more like a memory heap where stuff can be put, but not constructed in place."


    Does anyone know more about this?

    • nipunn1313 13 days ago

      Hey @haberman! One of the authors here. We actually do have an arena-esque implementation built on top of pb-jelly internally, as it was needed for Magic Pocket.

      It's built on top of the Blob traits exposed by pb-jelly. It's not yet open-source, but it would be a good candidate to do next! It also definitely has unsafe code to your point. We open sourced the safe implementations that uses more standard types (Bytes/Buffer/Vec) first.

      There's a decent amount of cleanup needed before we can opensource that as well, as much of it was built years ago, when rust ecosystem was less mature (eg Bytes/Buffer weren't around yet).

      I like where you're thinking!

      • haberman 13 days ago

        That's great, I'll look forward to seeing the arena-oriented code someday. :)

      • pcwalton 14 days ago

        Placement new is just an optimization to avoid the initial memcpy from the stack in cases in which LLVM can't work it out itself. I don't believe that placement new ever enables semantics that aren't possible with plain old move semantics.

        • haberman 13 days ago

          Ah, in that case it sounds like heterogenous arenas are more or less a solved problem in Rust, even if they aren't necessarily 100% optimal.

          Probably the more difficult piece then is just how to model arena ownership of a tree of objects that all have links between them. We want to guarantee that links to sub-messages remain valid, which we would expect to be true if they are all in the same arena. But I believe Rust allows moving/swapping objects in and out of the arena?

          • eddyb 8 days ago

            > heterogenous arenas are more or less a solved problem in Rust

            To expand a bit on what pcwalton said: Rust never had a concept of a type you can't move (we likely can't ever introduce it) and placement was never about "in-place construction" (Rust doesn't even have a concept of "construction", but rather focuses on initialization).

            There is `Pin<&mut T>` which prevents e.g. replacing the value behind it with another `T` one, but there the restriction is in the pointer itself (you can think of it as `&pin mut T`), not in the pointee, and it's not that relevant here, it mainly exists to allow library code to correctly interface with the internally self-referential generators (that `async fn`s are built on).

            In the scenario you describe you would use shared references (with appropriate lifetime parameterization on relevant data types), not mutable ones (which imply exclusivity), nor raw pointers (which is the only way to get the C++-like problems). That will limit you to immutable data by default, so if you want to mutate some leaf fields, you wrap them in `Cell` (which prevents invalidation because `&Cell<T>` doesn't allow creating references inside the `T` value) or some kind of atomic/lock if you want concurrent access.

            This already solves all the problems you were thinking of, and it doesn't even fundamentally require an arena (what you need an arena for is the dynamic graph structure, assuming you want to use references/pointers and not integer indices).

            In essence, the more you rely on safe Rust features, the harder this is to get wrong or end up with a misusable API.

            Oh and the Rust compiler itself has used arenas since before 1.0 (IIRC), and with every year we move more data into them, though most are 100% immutable (which helps with compiler correctness, especially incremental recompilation), or even interned (deep deduplication by caching allocation attempts). We can even get away without dynamically tracking destructors by making most arena-allocated data destructor-less (we mostly just had to replace `Vec<T>` with arena-allocated `&[T]` to make that work).

        • zozbot234 14 days ago

          Rust objects can always be moved to arbitrary places in memory, so I'm not sure how "in place" construction of a Rust object can even make sense unless it explicitly involves the Pin<P> feature to prevent moves. This is quite different from C/C++ where there's no default expectation that a constructed object can be bitwise-copied somewhere else in memory. Even Box<T> is really a special case since the object is meant to be accessed via an owning pointer or a (shared or mutable) reference; when accessing the underlying T by value, that always involves a move of the object so again there's no such thing as being "in place".

          • staticassertion 14 days ago

            Placement new is just avoiding a copy.

            Without placement new allocating an object involves:

            a) Constructing the object on the stack

            b) Copying it to the heap

            With placement new you construct the object directly on the heap, thus avoiding a copy.

            It's purely an optimization to avoid a single copy. It's a bit surprising that Rust doesn't have it, given the focus on control over such things, and that it had an experimental 'box' keyword for this at 1.0.

            edit: To be clear, the extra copy is not guaranteed - compilers can elide it. In order to help the compiler do this you can look at crates like boxext, which provide extension methods on Box that help the compiler to remove the copy. But placement new guarantees this.

            • kevincox 13 days ago

              I haven't checked this case but often the compiler will elide the copy. However for cases like this you want it to be very reliable to guaranteed so it would be nice to get that support.

              • nemothekid 13 days ago

                IIRC, Rust did have this optimization (or LLVM was doing it), but then there was a version bump or change in IR that disabled it and it wasn't trivial to fix.

                • staticassertion 13 days ago

                  There should be an optimization pass for this, certainly, but the placement new syntax guarantees it.

            • zamalek 13 days ago

              Check out bumpalo[1]. Dodrio is a real-world[2] usage of bumpalo, so you can inspect that code to see how he deals with lifetimes correctly.

              Edit: GATs are how "placement new" (Rust doesn't have new at all, hence the airquotes) would work in Rust, assuming the author here meant to say "custom allocators." With GATs, you could create a pointer (Box, Arc, ArenaBox etc.) trait and use that on your message types.[3] "Placement new" is a whole different issue that ultimately boils down to a compiler optimization that is currently missing/not working and shouldn't functionally affect what you are trying to do at all.

              [1]: https://github.com/fitzgen/bumpalo [2]: https://github.com/fitzgen/dodrio [3]: https://rust-lang.github.io/rfcs/1598-generic_associated_typ...

              • staticassertion 14 days ago

                AFAIK the current method for hacking in placement new is to use something like this: https://github.com/glandium/boxext

                Some more context. There used to be a `box` keyword too. https://github.com/rust-lang/rust/issues/50047

                • eximius 13 days ago
                  • zelly 13 days ago

                    R*st was designed to fragment the heap as much as possible. There's no way to fix it without major changes to the compiler or using "unsafe".

                  • q3k 14 days ago

                    From pb-jelly-gen:

                    > The core of this crate is a python2 script codegen.py that is provided to the protobuf compiler, protoc as a plugin.

                    That's... surprisingly janky. Not only Python tooling is always painful to deal with (compared to Go/Rust/...), but Python 2? And in a project that otherwise has no reason to depend on Python? :(

                    In comparison, the Go protoc plugin is written in Go, the alternatice rust-protobuf protoc plugin is witten in Rust, the Typescript one is written in Typescript...

                    • nipunn1313 14 days ago

                      Hi! One of the authors here. This was an oversight in the documentation. The codegen is py2 and py3 compatible. Fixed!

                      See issues https://github.com/dropbox/pb-jelly/issues/37 and https://github.com/dropbox/pb-jelly/issues/40 for context.

                      • cbhl 14 days ago

                        The very first prototypes of Dropbox were Py2, so I suspect that legacy is a reason why it was chosen for the codegen.

                        My $dayjob also started as a large python code base, and so there are still lots of one-off Py2 scripts that need to be deleted or rewritten.

                        • rvz 14 days ago

                          > In comparison, the Go protoc plugin is written in Go, the alternatice rust-protobuf protoc plugin is witten in Rust, the Typescript one is written in Typescript...

                          So they're maintaining a project with two languages with Python 2 as a hard requirement for code generation? Oh dear.

                          This tells me that this project will have the same fate as djinni [0] or their similar archived projects.

                          [0] https://github.com/dropbox/djinni

                          • rbtying 13 days ago

                            another former contributor to pb-jelly, though no longer at Dropbox.

                            protoc plugins have an interesting bootstrapping problem as well: the protoc-gen-$LANG interface requires the ability to ser/de protobuf messages that describe the proto file's AST. If your build system builds almost everything from scratch, including the protoc plugin, this means that you need to have a variant of your protoc plugin linked to a working proto implementation...

                            That's not to say this is impossible or even difficult, but at the time that I last looked at it (more than a year ago at this point), it made it fairly unpalatable to move the codegen from Python to Rust.

                            • eggsnbacon1 14 days ago

                              this might be one of those projects that gets open sourced with the hope that someone else will maintain it

                            • deepsun 14 days ago

                              There's already 6 different protobuf libraries for Rust: [1]

                              I've chosen Prost for our project, but see the whole list:


                              • q3k 14 days ago

                                It's 'only' really 4 (two of them are gRPC implementations).

                                But yeah - this is one of those things that makes me stay with Go instead of moving over to Rust for my backend SOA/microservice work. In Rust, for everything you need to do, there's at least 5 different libraries that implement that, all competing with eachother. This is especially annoying when dealing with transitive dependencies. Meanwhile in Go, you generally get one choice - it might be not great, but that's fine, it doesn't have to be.

                                EDIT: This is not intended to be mindless bashing of Rust. I do use Rust for other things. It's a fine language.

                                • woah 14 days ago

                                  IMO the time spent choosing a library in Rust is about equal to the spent debugging a null pointer in Go, but a Rust project will have a lot less libraries than a Go project will have null pointers

                                  • saagarjha 13 days ago

                                    Perhaps you're not auditing your code as much as you ostensibly should.

                                  • staticassertion 14 days ago

                                    > In Rust, for everything you need to do, there's at least 5 different libraries that implement that, all competing with eachother.

                                    That's interesting. This seems like a good thing to me. I've had no issues with microservices in Rust related to having too many crates.

                                    • jeffbee 14 days ago

                                      There are at least two Go protoc plugins, protobuf-go and gogoproto.

                                      • q3k 14 days ago

                                        It's the only real alternative implementation (and more precisely, a fork of upstream protobuf), _and_ there is strong cooperation [1] [2] between both projects to maintain a level of interoperability.

                                        And, it's not even protobuf I have a problem with - but things like HTTP implementations. There still isn't a canonical HTTP client/server implementation for Rust, while in Go basically everyone just uses `net/http`, or something that builds on top of that. Same for cryptographic primitives, TLS, context, ...

                                        [1] - https://docs.google.com/document/d/19kfhro7-CnBdFqFk7l4_Hmwa...

                                        [2] - https://github.com/gogo/protobuf/issues/386

                                        • jeffbee 14 days ago

                                          I imagine this stems from the fact that Go needed to be a complete production-ready backend language at launch, whereas Rust has other origins. At Google any "hello, world" program is a web server, at Mozilla not so much.

                                          • bfrog 12 days ago

                                            Not everyone uses net/http, and definitely not the builtin slower than python json marshaler.

                                    • staticassertion 14 days ago

                                      This is great. AFAIK this is the only protobuf library in Rust that supports zero copy. Maybe this'll help some of the other libraries implement similar features?

                                      • lightgreen 14 days ago

                                        rust-protobuf supports zero-copy when using `bytes` feature and reading from `Bytes`.

                                      • rapsey 14 days ago

                                        Quick-protobuf has been around for a while and supports it.

                                        • staticassertion 14 days ago

                                          Thanks, I hadn't seen this. I'm curious about how they compare - it looks like quick-protobuf uses Cow, but I think pb-jelly doesn't?

                                          • rapsey 14 days ago

                                            Jelly uses an external crate called Bytes to achieve it. It has nicer usability compared to Cow I guess.

                                      • zxv 14 days ago

                                        Cool. It would be interesting to compare benchmarks the RPC latency under heavy load for pb-jelly compared to other RPC methods.

                                        Rust has such great support for performant zero copy serialization and de-serialization in various formats (bincode, message pack, cbor, bson). Seeing this for protobuf feels very encouraging.

                                        • jl2718 14 days ago

                                          Is there really any good reason for code-gen? Just because “google did it” doesn’t mean it’s a good idea.

                                          • jakeva 14 days ago

                                            How about because for deterministic output, it suits the problem? Just because you can do it by hand doesn't mean I want to.

                                            • foolfoolz 14 days ago

                                              code generation is great for sharing data models and writing clients for services without dependencies

                                              • SOLAR_FIELDS 13 days ago

                                                Exactly this. Imagine you work at BigCorp and use protobuf to pass data around - now you have a unified data model you can share and everyone can use the same client to access it without going through the trouble of maintaining all those getters and setters. Rolling your own getters and setters is fine in a small project but you really see the advantages of the code gen approach once you are dealing with multiple different teams in an org working with the same complex data model.

                                                There are definitely some downsides to the approach though, mostly typical problems you would expect with machine generated code. Namely that it’s verbose and if you have a super complex protobuf data model (hundreds or thousands of fields) and want to ship a fatjar or similar bundling of dependencies you can run into some size issues.

                                              • q3k 14 days ago

                                                Type safety for your serialization/RPC layer.

                                                • jeffbee 14 days ago

                                                  Protobuf does not provide any type safety whatsoever. The name of the type of the message is carried in a side-channel, and the interpretation of that name is completely up to the endpoint that deserializes the message.

                                                  • q3k 14 days ago

                                                    Once you dispatch your binary/text protobuf into a proto message type however, you do get type safety, and it makes sense to carry over that type safety to implementing languages.

                                                    Plus, Any [1] is an effort to standardize the ability to carry the proto type (as a global identifier that can be used to retrieve its schema) alongside its serialized format.

                                                    [1] - https://developers.google.com/protocol-buffers/docs/proto3#a...

                                                • devit 13 days ago

                                                  It's required to have static type checking, which is essential. Also, way more time/memory efficient.

                                                • james412 14 days ago

                                                  I know it's common and perhaps even fashionable, but FWIW language like "We take an opinionated stance" utterly puts me off caring about this package

                                                  It's a piece of software, it has a design that is either fit for purpose or not. When ego becomes entangled in that design process, it's a strong indicator of the kind of experience one might have trying to get fixes or enhancements merged, or even the kind of attitude you'd find when attempting to report a bug.

                                                  • ajkjk 14 days ago

                                                    That's not what the word 'opinionated' means here. It's not any one person's opinion; it's that the project overall takes a stance on an issue rather than leaving everything open for everyone else to figure out. It provides clarity and direction compared to the more difficult situation where every library is completely general. No ego involved at all.

                                                    • james412 14 days ago

                                                      Perhaps I misunderstand the text in the README. Who is "we" in this case? Is the software writing its own README?

                                                      • ajkjk 13 days ago

                                                        'We' refers to the authors and their organization as a collective. That's still the meaning of the word 'opinionated' in this context.

                                                    • jspaetzel 14 days ago

                                                      More context is helpful here. "We take an opinionated stance that every module should be a crate, as opposed to generating Rust files 1:1 with proto files."