Boost.Http has a new parser

I usually don’t blog about updates on the Boost.Http project because I want as much info as possible in code and documentation (or even git history), not here. However, I got a stimulus to change this habit. A new parser I’ve been writing replaced the NodeJS parser in Boost.Http and here is the most appropriate place to inform about the change. This will be useful info also if you’re interested in using NodeJS parser, any HTTP parser or even designing a parser with a stable API unrelated to HTTP.

EDIT (2016/08/07): I tried to clarify the text. Now I try to make it clear whether I’m refering to the new parser (the new parser I wrote) or the old parser (NodeJS parser) everywhere in the text. I’ll also refer to Boost.Http with new parser as new Boost.Http and Boost.Http with old parser as old Boost.Http.

What’s wrong with NodeJS parser?

I started developing a new parser because several users wanted a header-only library and the parser was the main barrier for that in my library. I took the opportunity to also provide a better interface which isn’t limited to C language (inconvenient and lots of unsafe type casts) and uses a style that doesn’t own the control flow (easier to deal with HTTP upgrade and doesn’t require lots of jump’n’back among callbacks).

NodeJS parser argues you can pause it at any time, but its API doesn’t seem that reliable. You need to resort to ugly hacks[1][2] if you want to properly support HTTP pipelining with an algorithm that doesn’t “go back”. If you decide to not stop the parser, you need to store all intermediate results while NodeJS parser refuses to give control back to you, which forces you to allocate (even if NodeJS parser don’t).

NodeJS parser is hard to use. Not only the callback model forces me to go back and forth here and there, it’ll also force me to resort to ugly hacks full of unsafe casts which also increase object size to provide a generic templated interface[1][2][3][4]. Did I mention that it’ll always consume current data and I need to keep appending data everywhere (this is good and bad, the new parser I implemented does a good job at that)? The logic to handle field names is even more complex because this append stuff[1][2][3]. It’s even more complex because NodeJS won’t always decode the tokens (matching and decoding are separate steps) and you need to decode them yourself (and you need to know a lot of HTTP details).

The old parser is so hard to use that I wouldn’t dare to use the same tricks I’ve used in the new Boost.Http to avoid allocations on the old Boost.Http. So the NodeJS parser doesn’t allocate, but dealing with it (old Boost.Http) is so hard that you don’t want to reuse the buffer to keep incomplete tokens at all (forcing allocation or a big-enough secondary buffer to hold them in old Boost.Http).

HTTP upgrade is also very tricky and the lack of documentation for the NodeJS parser is depressing. So I only trust my own code as an usage reference for NodeJS parser.

However, I’ve hid all this complexity from my users. My users wanted a different parser because they wanted a header-only library. I personally only wanted to change the parser because the NodeJS parser only accepts a limited set of HTTP methods and  it was tricky to properly not perform any allocation. The new parser even makes it easier to reject an HTTP element before decoding it (e.g. a URL too long will exhaust the buffer and then the new Boost.Http can just check the `expected_token` function to know it should reply with 414 status code instead concatenating a lot of URL pieces until it detect the limit was reached).

If you aren’t familiar enough with HTTP details, you cannot assume the NodeJS parser will abstract HTTP framing. Your code will get the wrong result and it’ll go silent for a long time before you know it.

The new parser

EDIT(2016/08/09): The new parser is almost ready. It can be used to parse request messages (it’ll be able to parse response messages soon). It’s written in C++03. It’s header-only. It only depends on boost::string_ref, boost::asio::const_buffer and a few others that I may be missing from memory right now. The new parser doesn’t allocate data and returns control to the user as soon as one token is ready or an error is reached. You can mutate the buffer while the parser maintains a reference to it. And the parser will decode the tokens, so you do not need ugly hacks as NodeJS parser requires (removing OWS from the end of header field values).

I want to tell you that the new parser was NOT designed to Boost.Http needs. I wanted to make a general parser and the design started. Then I wanted to replace NodeJS parser within Boost.Http and parts have fit nicely. The only part that didn’t fit perfectly at the time to integrate pieces was a missing end_of_body token that was easy to add in the new parser code. This was the only time that I, as the author of Boost.Http and as a user of the new parser, used my power, as the author of the parser itself, to push my needs on everybody else. And this token was a nice addition anyway (using NodeJS API you’d use http_body_is_final).

My mentor Bjorn Reese had the vision to use an incremental parser much earlier than me. I’ve only been convinced to the power of incremental parsers when I’ve saw a CommonMark parser implemented in Rust. It convinced me immediately. It was very effective. Then I’ve borrowed several conventions on how to represent tokens in C++ from a Bjorn’s experiment.

There is also the “parser combinators” part of this project (still not ready) that I’ve only understood once I’ve watched a talk from Scott Wlaschin. Initially I was having a lot of trouble because I wanted stateful miniparsers to avoid “reparsing” certain parts, but you rarely read 1-sized chunks and I was only complicating things. The combinators part is tricky to deliver, because the next expected token will depend on the value (semantic, not syntax) of current token and this is hard to represent using expressions like Boost.Spirit abstractions. Therefore, I’m only going to deliver the mini-parsers, not the combinators. Feel free to give me feedback/ideas if you want to.

Needless to say the new parser should have the same great features from NodeJS parser like no allocations or syscals behind the scenes. But it was actually easier to avoid and decrease allocations on Boost.Http thanks to the parser’s design of not forcing the user to accumulate values on separate buffers and making offsets easy to obtain.

I probably could achieve the same effect of decreased buffers in Boost.Http with NodeJS parser, but it was quite hard to work with NodeJS parser (read section above). And you should know that the old Boost.Http related to the parser was almost 3 times bigger (it’d be almost 4 times bigger, but I had to add code to detect keep alive property because the new parser only care about message framing) than the new Boost.Http code related to the parser.

On the topic of performance, the new Boost.Http tests consume 7% more time to finish (using a CMake Release build with GCC under my machine). I haven’t spent time trying to improve performance and I think I’ll only try to improve memory usage anyway (the size of the parser structure).

A drawback (is it?) is that the new parser only cares about structuring the HTTP stream. It doesn’t care about connection state (exception: receiving http 1.0 response body/connection close event). Therefore, you need to implement the keep-alive yourself (which the Boost.Http higher-level layers do).

I want to emphasize that the authors of the NodeJS parser have done a wonderful job with what they had in hands: C!

Migrating code to use the new parser

First, I haven’t added the code to parse the status line yet, so the parser is limited to HTTP requests. It shouldn’t take long (a few weeks until I finish this and several other tasks).

When you’re ready to upgrade, use the history of the Boost.Http project (files include/boost/http/socket-inl.hpp and include/boost/http/socket.hpp) as a guide. If you’ve been using NodeJS parser improperly, it’s very likely your code didn’t have as much lines as Boost.Http had. And your code probably isn’t as templated as Boost.Http anyway, so it’s very likely you didn’t need as much tricks with http_parser_settings as Boost.Http needed.

Tufão project has been using NodeJS parser improperly for ages and it’d be hard to fix that. Therefore, I’ll replace “Tufão’s parser” with this new shiny one in the next Tufão release Tufão 1.4.0 has been refactored to use this new parser. It’ll finally gain It finally received support for HTTP pipelining and plenty of bugfixes that nobody noticed will land landed. Unfortunately I got the semantics for HTTP upgrade within Tufão wrong and it kind of has “forced HTTP upgrade” (this is something I got right in Boost.Http thanks to RFC7230 clarification).

Next steps

I may have convinced you to prefer Boost.Http parser over NodeJS parser when it comes to C++ projects. However, I hope to land a few improvements before calling it ready.

API/design-wise I hope to finish miniparsers for individual HTTP ABNF rules.

Test wise I can already tell you more than 80% of all the code written for this parser are tests (like 4 lines of test for each 1 line of implementation). However I haven’t run the tests in combination with sanitizers (yet!) and there a few more areas where tests can be improved (include coverage, allocate buffer chunks separately so sanitizers can detect invalid access attempts, fuzzers…) and I’ll work on them as well.

I can add some code to deduce the optimum size for indexes and return a parser with a little less overhead memory-wise.

EDIT (2016/08/11)

I’ve added a link that should be here since the first version of this post: https://gist.github.com/vinipsmaker/4998ccfacb971a0dc1bd

It adds a lot of background on the project. This is the proposal I’ve sent to get some funding to work on the project.

Tags:, , ,

Comentários (with MarkDown support)