I usually don’t blog about updates on the Boost.Http project because I want as much info as possible in code and documentation (or even git history), not here. However, I got a stimulus to change this habit. A new parser I’ve been writing replaced the NodeJS parser in Boost.Http and here is the most appropriate place to inform about the change. This will be useful info also if you’re interested in using NodeJS parser, any HTTP parser or even designing a parser with a stable API unrelated to HTTP.
EDIT (2016/08/07): I tried to clarify the text. Now I try to make it clear whether I’m refering to the new parser (the new parser I wrote) or the old parser (NodeJS parser) everywhere in the text. I’ll also refer to Boost.Http with new parser as new Boost.Http and Boost.Http with old parser as old Boost.Http.
What’s wrong with NodeJS parser?
I started developing a new parser because several users wanted a header-only library and the parser was the main barrier for that in my library. I took the opportunity to also provide a better interface which isn’t limited to C language (inconvenient and lots of unsafe type casts) and uses a style that doesn’t own the control flow (easier to deal with HTTP upgrade and doesn’t require lots of jump’n’back among callbacks).
NodeJS parser argues you can pause it at any time, but its API doesn’t seem that reliable. You need to resort to ugly hacks if you want to properly support HTTP pipelining with an algorithm that doesn’t “go back”. If you decide to not stop the parser, you need to store all intermediate results while NodeJS parser refuses to give control back to you, which forces you to allocate (even if NodeJS parser don’t).
NodeJS parser is hard to use. Not only the callback model forces me to go back and forth here and there, it’ll also force me to resort to ugly hacks full of unsafe casts which also increase object size to provide a generic templated interface. Did I mention that it’ll always consume current data and I need to keep appending data everywhere (this is good and bad, the new parser I implemented does a good job at that)? The logic to handle field names is even more complex because this append stuff. It’s even more complex because NodeJS won’t always decode the tokens (matching and decoding are separate steps) and you need to decode them yourself (and you need to know a lot of HTTP details).
The old parser is so hard to use that I wouldn’t dare to use the same tricks I’ve used in the new Boost.Http to avoid allocations on the old Boost.Http. So the NodeJS parser doesn’t allocate, but dealing with it (old Boost.Http) is so hard that you don’t want to reuse the buffer to keep incomplete tokens at all (forcing allocation or a big-enough secondary buffer to hold them in old Boost.Http).
HTTP upgrade is also very tricky and the lack of documentation for the NodeJS parser is depressing. So I only trust my own code as an usage reference for NodeJS parser.
However, I’ve hid all this complexity from my users. My users wanted a different parser because they wanted a header-only library. I personally only wanted to change the parser because the NodeJS parser only accepts a limited set of HTTP methods and it was tricky to properly not perform any allocation. The new parser even makes it easier to reject an HTTP element before decoding it (e.g. a URL too long will exhaust the buffer and then the new Boost.Http can just check the `expected_token` function to know it should reply with 414 status code instead concatenating a lot of URL pieces until it detect the limit was reached).
If you aren’t familiar enough with HTTP details, you cannot assume the NodeJS parser will abstract HTTP framing. Your code will get the wrong result and it’ll go silent for a long time before you know it.
The new parser
EDIT(2016/08/09): The new parser is almost ready. It can be used to parse request messages (it’ll be able to parse response messages soon). It’s written in C++03. It’s header-only. It only depends on boost::string_ref, boost::asio::const_buffer and a few others that I may be missing from memory right now. The new parser doesn’t allocate data and returns control to the user as soon as one token is ready or an error is reached. You can mutate the buffer while the parser maintains a reference to it. And the parser will decode the tokens, so you do not need ugly hacks as NodeJS parser requires (removing OWS from the end of header field values).
I want to tell you that the new parser was NOT designed to Boost.Http needs. I wanted to make a general parser and the design started. Then I wanted to replace NodeJS parser within Boost.Http and parts have fit nicely. The only part that didn’t fit perfectly at the time to integrate pieces was a missing end_of_body token that was easy to add in the new parser code. This was the only time that I, as the author of Boost.Http and as a user of the new parser, used my power, as the author of the parser itself, to push my needs on everybody else. And this token was a nice addition anyway (using NodeJS API you’d use http_body_is_final).
My mentor Bjorn Reese had the vision to use an incremental parser much earlier than me. I’ve only been convinced to the power of incremental parsers when I’ve saw a CommonMark parser implemented in Rust. It convinced me immediately. It was very effective. Then I’ve borrowed several conventions on how to represent tokens in C++ from a Bjorn’s experiment.
There is also the “parser combinators” part of this project (still not ready) that I’ve only understood once I’ve watched a talk from Scott Wlaschin. Initially I was having a lot of trouble because I wanted stateful miniparsers to avoid “reparsing” certain parts, but you rarely read 1-sized chunks and I was only complicating things. The combinators part is tricky to deliver, because the next expected token will depend on the value (semantic, not syntax) of current token and this is hard to represent using expressions like Boost.Spirit abstractions. Therefore, I’m only going to deliver the mini-parsers, not the combinators. Feel free to give me feedback/ideas if you want to.
Needless to say the new parser should have the same great features from NodeJS parser like no allocations or syscals behind the scenes. But it was actually easier to avoid and decrease allocations on Boost.Http thanks to the parser’s design of not forcing the user to accumulate values on separate buffers and making offsets easy to obtain.
I probably could achieve the same effect of decreased buffers in Boost.Http with NodeJS parser, but it was quite hard to work with NodeJS parser (read section above). And you should know that the old Boost.Http related to the parser was almost 3 times bigger (it’d be almost 4 times bigger, but I had to add code to detect keep alive property because the new parser only care about message framing) than the new Boost.Http code related to the parser.
On the topic of performance, the new Boost.Http tests consume 7% more time to finish (using a CMake Release build with GCC under my machine). I haven’t spent time trying to improve performance and I think I’ll only try to improve memory usage anyway (the size of the parser structure).
A drawback (is it?) is that the new parser only cares about structuring the HTTP stream. It doesn’t care about connection state (exception: receiving http 1.0 response body/connection close event). Therefore, you need to implement the keep-alive yourself (which the Boost.Http higher-level layers do).
I want to emphasize that the authors of the NodeJS parser have done a wonderful job with what they had in hands: C!
Migrating code to use the new parser
First, I haven’t added the code to parse the status line yet, so the parser is limited to HTTP requests. It shouldn’t take long (a few weeks until I finish this and several other tasks).
When you’re ready to upgrade, use the history of the Boost.Http project (files include/boost/http/socket-inl.hpp and include/boost/http/socket.hpp) as a guide. If you’ve been using NodeJS parser improperly, it’s very likely your code didn’t have as much lines as Boost.Http had. And your code probably isn’t as templated as Boost.Http anyway, so it’s very likely you didn’t need as much tricks with http_parser_settings as Boost.Http needed.
Tufão project has been using NodeJS parser improperly for ages and it’d be hard to fix that. Therefore,
I’ll replace “Tufão’s parser” with this new shiny one in the next Tufão release Tufão 1.4.0 has been refactored to use this new parser. It’ll finally gain It finally received support for HTTP pipelining and plenty of bugfixes that nobody noticed will land landed. Unfortunately I got the semantics for HTTP upgrade within Tufão wrong and it kind of has “forced HTTP upgrade” (this is something I got right in Boost.Http thanks to RFC7230 clarification).
I may have convinced you to prefer Boost.Http parser over NodeJS parser when it comes to C++ projects. However, I hope to land a few improvements before calling it ready.
API/design-wise I hope to finish miniparsers for individual HTTP ABNF rules.
Test wise I can already tell you more than 80% of all the code written for this parser are tests (like 4 lines of test for each 1 line of implementation). However I haven’t run the tests in combination with sanitizers (yet!) and there a few more areas where tests can be improved (include coverage, allocate buffer chunks separately so sanitizers can detect invalid access attempts, fuzzers…) and I’ll work on them as well.
I can add some code to deduce the optimum size for indexes and return a parser with a little less overhead memory-wise.
It adds a lot of background on the project. This is the proposal I’ve sent to get some funding to work on the project.
Há um projeto de software livre que eu iniciei chamado Tufão. O objetivo do projeto era tornar a linguagem C++ amigável para desenvolvimento web. A diferença é que por muito tempo desenvolvimento web fazia o contrário de me atrair e isso só mudou depois que conheci o Node.js, que acabou influenciando na arquitetura do Tufão. Há muitos e muitos meses atrás, o Tufão era hospedado no Google Code, mas devido a alguns motivos eu acabei migrando para o github.
Motivos da mudança
Eu migrei o Tufão para o github pelo simples motivo de que a linguagem de marcação usada para customizar a página inicial do projeto no Google Code não suporta listas aninhadas muito bem.
O motivo da migração pode parecer decepcionante, então eu vou dizer que outro motivo da migração é que eu finalmente pude deixar a documentação do projeto online, pois o Github gera um site online para você a partir do branch especial gh-pages e uma documentação online é algo que eu queria muito. A tentativa de converter a documentação gerada pelo Doxygen para a wiki do Google Code foi um resultado bem ruim. E na época que eu usava o Google Code acabava oferecendo a documentação gerada como uma opção download, uma tarefa bem inconveniente. E essa gambiarra nem funcionaria hoje em dia, pois “devido a mal uso da funcionalidade”, a Google desativou a funcionalidade.
O primeiro impacto que o github trouxe para o projeto, é que antigamente você não tinha um jeito fácil de oferecer o download dos binários do projeto, então eu parei de oferecer binários para a plataforma Windows (que a partir de agora você tem que gerar por sua conta), assim como os usuários pararam de encher meu saco com essa tarefa que pode ter uma explosão combinatória, pois o ambiente de desenvolvimento pode variar muito entre sistemas diferentes.
O segundo impacto que o github trouxe, é que ele mapeia muito bem a natureza distribuída do git e tornou-se muito fácil contribuir para o projeto. Você pode conferir no próprio github que há pessoas modificando cópias próprias do projeto, assim como houveram outros contribuidores além de mim.
O terceiro impacto que o GitHub trouxe foi me deixar viciado em MarkDown. É o requisito #1 para caixas de textos de qualquer serviço na web. Eu uso gists secretos para manter minhas listas de tarefas, pois há suporte a MarkDown, eu escrevo minhas propostas usando MarkDown, eu faço slides para apresentações usando MarkDown, eu uso e abuso do MarkDown…
Uma característica que eu notei é que o bug tracker do Google Code aparentava ser mais completo, mas para um projeto do tamanho do Tufão, isso ainda não impactou negativamente o projeto. Só para citar como exemplo, eu podia anexar arquivos arbitrários a comentários que eu fizesse aos bugs registrados, no Google Code. No GitHub eu só aprendi a anexar imagens. Acho que até o bugtracker do serviço launchpad é mais completo e acho que o github não vai mudar, pois essa simplicidade torna o serviço dele mais “user-friendly”, que é a mesma desculpa para eles ignorarem as reclamações do Linus Torvalds.
Since Tufão 0.4, I’ve been using CMake as the Tufão build system, but occasionally I see some users reimplementing the qmake-based project files and I thought it’d be a good idea to explain/document why such rewrite is a bad idea. This is the post.
Simply and clear (1 reason)
It means *nothing* to your qmake-based project.
What your qmake-based project needs is a *.pri file to you include in your *.pro file. And such *.pri file *is* generated (and properly included in your Qt-installation) by Tufão. You’ll just write the usual “CONFIG += TUFAO” without *any* pain.
Why won’t I use qmake in Tufão ever again (long answer)
Two reasons why is it a bad idea:
- You define only one target per file. You need subdirs. It’s hard.
- The Tufão unit testing based on Qt Test requires the definition of separate executables per test and the “src/tests/CMakeLists.txt” CMake file beautifully defines 13 tests. And with the CMake-based system, all you need to do to add a new test is add a single line to the previously mentioned file. QMake is so hard that I’d define a dumb test system that only works after you install the lib, just to free myself from the qmake-pain.
- There is no easy way to preprocess files.
- And if you use external commands that you don’t provide yourself like grep, sed or whatever, then your build system will be less portable than autotools. Not every Windows developer likes autotools and your approach (external commands) won’t be any better.
- All in all, it becomes hard to write a build system that will install the files required by projects that use QMake, CMake or PKG-CONFIG (Tufão supports all three).
The reasons above are the important reasons, but there are others like the fact that documentation is almost as bad as the documentation for creation of QtCreator plugins.
The ever growing distance from QMake
When Tufão grows, sometimes the build system becomes more complicated/demanding, and when it happens, I bet the QMake-based approach will become even more difficult to maintain. The most recent case of libtufao.pro that I’m aware of had to include some underdocumented black magic like the following just to meet the Qt4/5 Tufão 0.x demand:
You like a CMake-based Tufão
The current CMake-based build system of Tufão provides features that you certainly enjoy. At least the greater stability of the new unit testing framework requires CMake and you certainly want a stable library.
In the beginning, QMake met Tufão requirements of a build system, but I wouldn’t use it again for demanding projects.
But I don’t hate QMake and I’d use it again in a Qt-based project if, and only if, I *know* it won’t have demanding needs.
Of course I’ll hate QMake if people start to overuse it (and creating me trouble).
And if you still wants to maintain a QMake-based Tufão project file, at least you’ve been warned about the *pain*
and inferior solution you’ll end up with.
Today I’ve spent some minutes of my time to fix the metadata of the Tufão git repo. The issue makes difficult to gather who author owns which lines of the code, which can be a problem if you want give merit for the real coders (maybe this is related to ethics) and if you want to contact the authors later for actions that only can be done by the copyright owner (eg. change license). This incorrect info was introduced by misuse of the git tool.
First, I must admit that I was wrong and the issue was entirely caused by ignorance of my git’s knowledge at the time. The issue was not made on purpose. You can see an example here, where I mentioned the original author on the commit message to give the appropriate credit (my intent). But there are also cases where I failed to even make such mention.
Every commit you make using the git tool has an associated author and if you don’t specify the author explicitly, git will use the global config. This metadata is used by commands such as git shortlog -s and git blame. The solution is simply: Just set the author explicitly using the –author argument.
Now the git repo history mentions authors such as Paul Maseberg and Marco Molteni.
Lastly, but not least, I want to let you know that I believe the problem is solved. If you find something that I missed, just fill a bug on the github and I’ll fix it.
Apresentei a palestra sobre o Tufão no FISL 14. Estava um pouco nervoso e acabei omitindo muita coisa que planejava discutir inicialmente, além de falar um pouco rápido também no começo da apresentação. Além disso, eu tinha preparado a palestra para pessoas iniciantes e foi um pouco inesperado encontrar pessoas experientes (tive que pular várias explicações que planejava fazer inicialmente), mas isso me deixou feliz quando perguntas sobre “o porquê” junto de algum argumento chegavam. De qualquer forma, colocando os slides e o vídeo aqui para quem estiver interessado. Farei um post sobre o restante do FISL separadamente, mais legal.
E me arrependi de não ter colocado uma imagem-bônus que quebrasse um pouco das regras dos “bons costumes” que acabam criando uma barreira entre o palestrante e a plateia. Decidi que todas as minhas palestras, a partir de agora, terão uma imagem divertida.
Espero mostrar rapidamente o histórico do projeto e usar o tempo restante para a motivação, reservando 10 minutos ao final para perguntas. A motivação deve incluir uma pequena explicação de como funciona a web e seus principais pontos, comparações, becnhmarks (esse é um item mais difícil), limitações, exemplos e um “tutorial”.
Devo preparar uma nova aplicação de exemplo que explora características presentes no Tufão.
After a long time developing Tufão, it finally reached 1.0 version some hours ago. I’ve spent a lot of time cleaning up the API and exploring the features provided by C++11 and Qt5 to release this version.
This is the first Tufão release that:
- … breaks config files, so you’ll need to update your config files to use them with the new version
- … breaks ABI, so you’ll need to recompile your projects to use the new version
- … breaks API, so you’ll need to change your source code to be able to recompile your previous code and use the new version
- … breaks previous documented behaviour, so you’ll need to change the way you use features that were available before. But don’t worry, because the list of these changes are really small and are well documented below.
Porting to Tufão 1.0
From now on, you should link against tufao1 instead of tufao. The PKGCONFIG, qmake and CMake files were renamed also, so you can have different Tufão libraries in the same system if their major version differs.
The list of behavioural changes are:
- Headers are being stored using a Hash-table, so you can’t easily predict (and shouldn’t) the order of the headers anymore. I hope this change will improve the performance.
- HttpServerRequest::ready signal auto-disconnects all slots connected to the HttpServerRequest::data and HttpServerRequest::end slots before being emitted.
- HttpFileServer can automatically detect the mime type from the served files, so if you had your own code logic to handle the mimes, you should throw it away and enjoy a lesser code base to maintain.
The list of changes:
- The project finally have a logo (made by me in Inkscape)
- Deprecated API was removed
- Url and QueryString removed in favor of QUrl
- Headers refactored to inherit from QMultiHash instead of QMultiMap
- Constructor’s options argument is optional now
- setOptions method added
- Constructor takes a reference to a QIODevice instead a pointer
- Constructor takes a reference to a QAbstractSocket instead a pointer
- socket method returns a reference instead a pointer
- url returns a QUrl
- data signal was changed and you must use readBody method to access body’s content
- the upgrade’s head data is accessed from the request body from now on
- now the object auto-disconnects slots from data and end signals right before emit ready
- setCustomData and customData methods added
- Now HttpServerRequestRouter use these methods to pass the list of captured texts
- HttpServer uses reference instead of pointers in several places
- AbstractHttpServerRequestRouter refactored to explore lambdas features
- Tufão’s plugin system fully refactored
- It’s using JSON files as configuration
- It uses references instead pointers
- It receives 2 arguments instead of 3
- One more abstraction to sessions created to explore lambdas
- startServerHandshake is taking references instead pointers
- LESS POINTERS and MORE REFERENCES
- This change exposes a model more predictive and natural
- I’m caring less about Qt style and more about C++ style
- But don’t worry, I’ll maintain a balance
- Using scoped enums
- HttpFileServer uses/sends mime info
- Interfaces don’t inherit from QObject anymore, so you can use multiple inheritance to make the same class implement many interfaces
- HttpUpgradeRouter introduced
- HttpServer::setUpgradeHandler also
- Updated QtCreator plugin to work with QtCreator 2.7.0 and Qt 5
I want to improve the Tufão’s stability and performance, so now I’ll focus on a minor relase with unit testing and some minor changes.
After Tufão 1.0.1, I’ll focus on BlowThemAll.
You can see a visualization video based on the history of commits below.
This release deserves a wallpaper, so I made one. See it below:
You can download the source of the previous wallpaper here.
What are you still doing here? Go download the new version right now!