I usually don’t blog about updates on the Boost.Http project because I want as much info as possible in code and documentation (or even git history), not here. However, I got a stimulus to change this habit. A new parser I’ve been writing replaced the NodeJS parser in Boost.Http and here is the most appropriate place to inform about the change. This will be useful info also if you’re interested in using NodeJS parser, any HTTP parser or even designing a parser with a stable API unrelated to HTTP.
EDIT (2016/08/07): I tried to clarify the text. Now I try to make it clear whether I’m refering to the new parser (the new parser I wrote) or the old parser (NodeJS parser) everywhere in the text. I’ll also refer to Boost.Http with new parser as new Boost.Http and Boost.Http with old parser as old Boost.Http.
What’s wrong with NodeJS parser?
I started developing a new parser because several users wanted a header-only library and the parser was the main barrier for that in my library. I took the opportunity to also provide a better interface which isn’t limited to C language (inconvenient and lots of unsafe type casts) and uses a style that doesn’t own the control flow (easier to deal with HTTP upgrade and doesn’t require lots of jump’n’back among callbacks).
NodeJS parser argues you can pause it at any time, but its API doesn’t seem that reliable. You need to resort to ugly hacks if you want to properly support HTTP pipelining with an algorithm that doesn’t “go back”. If you decide to not stop the parser, you need to store all intermediate results while NodeJS parser refuses to give control back to you, which forces you to allocate (even if NodeJS parser don’t).
NodeJS parser is hard to use. Not only the callback model forces me to go back and forth here and there, it’ll also force me to resort to ugly hacks full of unsafe casts which also increase object size to provide a generic templated interface. Did I mention that it’ll always consume current data and I need to keep appending data everywhere (this is good and bad, the new parser I implemented does a good job at that)? The logic to handle field names is even more complex because this append stuff. It’s even more complex because NodeJS won’t always decode the tokens (matching and decoding are separate steps) and you need to decode them yourself (and you need to know a lot of HTTP details).
The old parser is so hard to use that I wouldn’t dare to use the same tricks I’ve used in the new Boost.Http to avoid allocations on the old Boost.Http. So the NodeJS parser doesn’t allocate, but dealing with it (old Boost.Http) is so hard that you don’t want to reuse the buffer to keep incomplete tokens at all (forcing allocation or a big-enough secondary buffer to hold them in old Boost.Http).
HTTP upgrade is also very tricky and the lack of documentation for the NodeJS parser is depressing. So I only trust my own code as an usage reference for NodeJS parser.
However, I’ve hid all this complexity from my users. My users wanted a different parser because they wanted a header-only library. I personally only wanted to change the parser because the NodeJS parser only accepts a limited set of HTTP methods and it was tricky to properly not perform any allocation. The new parser even makes it easier to reject an HTTP element before decoding it (e.g. a URL too long will exhaust the buffer and then the new Boost.Http can just check the `expected_token` function to know it should reply with 414 status code instead concatenating a lot of URL pieces until it detect the limit was reached).
If you aren’t familiar enough with HTTP details, you cannot assume the NodeJS parser will abstract HTTP framing. Your code will get the wrong result and it’ll go silent for a long time before you know it.
The new parser
EDIT(2016/08/09): The new parser is almost ready. It can be used to parse request messages (it’ll be able to parse response messages soon). It’s written in C++03. It’s header-only. It only depends on boost::string_ref, boost::asio::const_buffer and a few others that I may be missing from memory right now. The new parser doesn’t allocate data and returns control to the user as soon as one token is ready or an error is reached. You can mutate the buffer while the parser maintains a reference to it. And the parser will decode the tokens, so you do not need ugly hacks as NodeJS parser requires (removing OWS from the end of header field values).
I want to tell you that the new parser was NOT designed to Boost.Http needs. I wanted to make a general parser and the design started. Then I wanted to replace NodeJS parser within Boost.Http and parts have fit nicely. The only part that didn’t fit perfectly at the time to integrate pieces was a missing end_of_body token that was easy to add in the new parser code. This was the only time that I, as the author of Boost.Http and as a user of the new parser, used my power, as the author of the parser itself, to push my needs on everybody else. And this token was a nice addition anyway (using NodeJS API you’d use http_body_is_final).
My mentor Bjorn Reese had the vision to use an incremental parser much earlier than me. I’ve only been convinced to the power of incremental parsers when I’ve saw a CommonMark parser implemented in Rust. It convinced me immediately. It was very effective. Then I’ve borrowed several conventions on how to represent tokens in C++ from a Bjorn’s experiment.
There is also the “parser combinators” part of this project (still not ready) that I’ve only understood once I’ve watched a talk from Scott Wlaschin. Initially I was having a lot of trouble because I wanted stateful miniparsers to avoid “reparsing” certain parts, but you rarely read 1-sized chunks and I was only complicating things. The combinators part is tricky to deliver, because the next expected token will depend on the value (semantic, not syntax) of current token and this is hard to represent using expressions like Boost.Spirit abstractions. Therefore, I’m only going to deliver the mini-parsers, not the combinators. Feel free to give me feedback/ideas if you want to.
Needless to say the new parser should have the same great features from NodeJS parser like no allocations or syscals behind the scenes. But it was actually easier to avoid and decrease allocations on Boost.Http thanks to the parser’s design of not forcing the user to accumulate values on separate buffers and making offsets easy to obtain.
I probably could achieve the same effect of decreased buffers in Boost.Http with NodeJS parser, but it was quite hard to work with NodeJS parser (read section above). And you should know that the old Boost.Http related to the parser was almost 3 times bigger (it’d be almost 4 times bigger, but I had to add code to detect keep alive property because the new parser only care about message framing) than the new Boost.Http code related to the parser.
On the topic of performance, the new Boost.Http tests consume 7% more time to finish (using a CMake Release build with GCC under my machine). I haven’t spent time trying to improve performance and I think I’ll only try to improve memory usage anyway (the size of the parser structure).
A drawback (is it?) is that the new parser only cares about structuring the HTTP stream. It doesn’t care about connection state (exception: receiving http 1.0 response body/connection close event). Therefore, you need to implement the keep-alive yourself (which the Boost.Http higher-level layers do).
I want to emphasize that the authors of the NodeJS parser have done a wonderful job with what they had in hands: C!
Migrating code to use the new parser
First, I haven’t added the code to parse the status line yet, so the parser is limited to HTTP requests. It shouldn’t take long (a few weeks until I finish this and several other tasks).
When you’re ready to upgrade, use the history of the Boost.Http project (files include/boost/http/socket-inl.hpp and include/boost/http/socket.hpp) as a guide. If you’ve been using NodeJS parser improperly, it’s very likely your code didn’t have as much lines as Boost.Http had. And your code probably isn’t as templated as Boost.Http anyway, so it’s very likely you didn’t need as much tricks with http_parser_settings as Boost.Http needed.
Tufão project has been using NodeJS parser improperly for ages and it’d be hard to fix that. Therefore,
I’ll replace “Tufão’s parser” with this new shiny one in the next Tufão release Tufão 1.4.0 has been refactored to use this new parser. It’ll finally gain It finally received support for HTTP pipelining and plenty of bugfixes that nobody noticed will land landed. Unfortunately I got the semantics for HTTP upgrade within Tufão wrong and it kind of has “forced HTTP upgrade” (this is something I got right in Boost.Http thanks to RFC7230 clarification).
I may have convinced you to prefer Boost.Http parser over NodeJS parser when it comes to C++ projects. However, I hope to land a few improvements before calling it ready.
API/design-wise I hope to finish miniparsers for individual HTTP ABNF rules.
Test wise I can already tell you more than 80% of all the code written for this parser are tests (like 4 lines of test for each 1 line of implementation). However I haven’t run the tests in combination with sanitizers (yet!) and there a few more areas where tests can be improved (include coverage, allocate buffer chunks separately so sanitizers can detect invalid access attempts, fuzzers…) and I’ll work on them as well.
I can add some code to deduce the optimum size for indexes and return a parser with a little less overhead memory-wise.
It adds a lot of background on the project. This is the proposal I’ve sent to get some funding to work on the project.
The newest update to the Boost.Http is that I had a long meeting with Vinnie Falco about a possible collaboration and a few changes are going to happen.
What this means:
- A lot of work making changes so projects can hopefully be merged in the future.
- API will break again.
- The thing I wasn’t caring about, “HTTP/1.1 oriented interface”, will be provided on a higher-level than simply an HTTP parser. This is what Beast.HTTP already provides.
- Beast.HTTP already provides WebSocket support and HTTP client support.
- We’ll be using a new mailing list to coordinate further development and I invite you to join the mailing list if you’re interested in the future of any of these libraries or HTTP APIs in general.
I’ll develop a HTTP pull parser for Boost.Http during this summer.
The story starts with Boost not being selected for Google Summer of Code this year. I wanted more funding to spend time on Boost.Http and this was unfortunate.
I’d rather work on the request router, but I don’t have a strong design for a request router right now because I’m still experimenting. A weak design would translate on a weak proposal and I decided to propose a HTTP parser.
An interesting HTTP library that carries some similarities with Boost.Http was announced on the Boost mailing list: Beast.
Uisleandro is working on a request router focused on ReST services for Boost.Http: https://github.com/BoostGSoC14/boost.http/compare/master…uisleandro:router1?expand=1.
Previously I informed that my library was about to be reviewed by the Boost community. And some time ago, this review happened and a result was published.
The outcome of the Boost review process was that my library should not be included in Boost. I think I reacted to the review very well and that I defended the library design with good arguments.
There was valuable feedback that I gained through the review process. Feedback that I can use to improve the library. And improvements to the library shouldn’t stop once the library is accepted into Boost, so I was expecting to spend more time in the library even after the review, as suggested by the presence of a roadmap chapter on the library documentation.
The biggest complaint about the library now was completeness. The library “hasn’t proven” that the API is right. The lack of higher-level building blocks was important to contribute to the lack of trust in current API. Also, if such library was to enter in Boost, it should be complete, so new users would have a satisfying first impression and continue to use the library after the initial contact. I was worried about delivering a megazord library to be reviewed in just one review, but that’s what will happen next time I submit the library to review. At least I introduced several concepts to the readers already.
Things that I planned were forgotten when I created the documentation and I’ll have to improve documentation once again to ensure guarantees that I had planned already. Also, some neat ideas were given to improve library design and further documentation updates will be required. Documentation was also lacking in the area of tutorial. Truth be told, I’m not very skilled in writing tutorials. Hopefully, the higher-level API will help me to introduce the library to newbies. Also, I can include several tutorials in the library to improve its status.
There was an idea about parser/generator (like 3-level instead 2-level indirection) idea that will require me to think even more about the design. Even now, I haven’t thought enough about this design yet. One thing for sure is that I’ll have to expose an HTTP parser because that’s the only thing that matters for some users.
A few other minor complaints were raised that can be addressed easily.
If you are willing to discuss more about the library, I have recently created a gitter channel where discussions can happen.
And for now, I need to re-read all messages given for the review and register associated issues in the project’s issue tracker. I’d like to have something ready by January, but it’ll probably take 6 months to 1 year before I submit the library again. Also, the HTTP client library is something that will possibly delay the library a lot, as I’ll research power users like Firefox and Chromium to make sure that the library is feature-ready for everybody.
So much work that maybe I’ll submit the library as a project on GSoC again next year to gather some more funding.
Also, I’d like to use this space to spread two efforts that I intend to make once the library is accepted into Boost:
- A Rust “port” of the library. Actually, it won’t be an 1:1 port, as I intend to use Rust’s unique expressiveness to develop a library that feels like a library that was born with Rust in mind.
- An enhanced, non-slow and not resource-hungry implementation (maybe Rust, maybe C++) of the trsst project.
And for now, I need to re-read all messages given for the review and register associated issues in the project’s issue tracker.
I’d like to have something ready by January, but it’ll probably take 6 months to 1 year before I submit the library again.
I think I was being too optimistic when I commented “6 months”. It’d only be possible to complete it within 6 months if Boost.Http was the only project I was developing. Of course I have a job and university and the splitted focus wouldn’t allow me to finish the library in just this small amount of time.
February last year, I was writing an email to the Boost user mailing list, asking for feedback on an HTTP server library proposal. Later that year, I got the news that I was to receive funds through Google to work on such proposal. And I had delivered a library by the asked timeframe. However, getting the library into Boost is another game. You must pass through the review process if you want to integrate some library into Boost. And I’d like to announce that my library is finally going to be reviewed starting on Friday.
After working on this project for more than a year, I’m pretty glad it finally reached this milestone. And I’m very confident about the review.
I had written about this project so much here and there that when the time to write about it in my own blog comes, I don’t have many words left. It’s a good thing to leave as much info on the documentation itself and don’t spread info everywhere or try to lure users into my blog by providing little information on the documentation.
I created a new category on this blog to track the progress, so you’ll be able to have a separate rss feed for these posts. The new category URL is https://vinipsmaker.wordpress.com/category/computacao/gsoc2014-boost/.