[neomutt-devel] Header Cache invalidation

Fri Sep 8 21:56:38 CEST 2017

> On 8 Sep 2017, at 11:17, Elimar Riesebieter <riesebie at lxtec.de> wrote:
> 
> * Richard Russon <rich at flatcap.org> [2017-09-07 22:37 +0000]:
> 
>> Elimar wrote on the user mailing list:
>>> Running 2017-09-07 I have to rebuild hcache
>>> I had to rebuild hcache in ... almost every new version
>> 
>> You're right and unfortunately there's not much we can do about it.
>> 
>> We've been making a lot of structural changes to the code.  These should
>> make it easier to maintain and test.
>> 
>> The header cache works like this (we think :-)
>> 
>> The email's header is parsed and lots of objects are created, which store
>> the details:
>> 
>> * struct Address
>> * struct Parameter
>> * struct Body
>> * struct Envelope
>> * struct Header
>> 
>> It's these objects which are stored in the database (they are serialised).
>> If any of the objects are changed, then the database values can't be used.
>> 
>> Clearly, we need to make sure that the structs we used to save the values to
>> the database are identical to those we read into.  Mutt's build calculated
>> the md5sum of the relevant header files.  This is done by hcache/hcachever.sh
>> 
>> If this checksum changes, then the existing header cache is invalid.
>> 
>> We would like to replace this caching with something more reliable, but that
>> would mean someone has to, first, understand it fully.
> 
> Many thanks for your efforts. It is not a big task to rebuild the
> hcache. It takes about 10 minutes per each machine. But I think it
> is a big task to rewrite the code. And hey, the neomutt dev's are
> doing a great job ;-)

One possible approach to mitigate this problem would be to introduce versioning in the serialized format and make sure changes to the source structures are made in a backwards-compatible way.

There are a number of readily available solutions out there, including avro and protocol buffers. We could take advantage of either the format alone or both the format and existing libraries.

There are also a number of issues to be taken into consideration, including:

* locality of the structure and the (de)serialization code: we want to make it as hard as possible for one programmer to forget to update the (de)serializers when updating the structures. This is a problem that we have already. Currently, a mismatch causes a crash in the best scenario and data corruption in the worst case. Ideally, the checks would need to be done at compile time and trigger a compiler error if the two don't match. This is something c++'s metaprogramming facilities would be great at, but don't let me go there just yet ;)

* dependencies: we probably don't want to have to depend on yet another 3rd party library.

* performance: we still want the (de)serialization to occur in a non-noticeable amout of time.

Just food for thought, for now. This is something I would like to investigate further in the future, if there's consensus that this is something we want to have.

I'm not sure the cost/benefit would be small enough, though.

-- 
Pietro Cerutti
gahr at gahr.ch