PSla Blog

Blog Piotra Ślatały | Peter Slatala's Blog

Serialization choices

I’ve been recently looking into different serialization options. While there are plenty of writeups (even in C#) already available, I wanted to:

  • Have one about C#
  • Learn something new 😉
  • Look into my particular data distribution / characteristics
  • Understand not only performance, but also size impact

I’d say that there are very few summaries that you should just rely on. With some exception, serialization is generally pretty fast, and the choices you make have also following considerations:

  • do you need it to be ‘human readable’?
    • Is it just enough to have a tool that presents the serialization for you?
    • Do you need tagged or untagged serialization? Untagged serialization can be noticably faster and smaller.
      What if even the field names aren’t preserved in serialization? If you don’t have a schema agreed upon (ahead of time), some of customers of your data might not know how to interpret it (e.g. visualization tools for data).
  • do you care about maximum compactness of data representation?
  • how much do you really care about performance? if you are going to send it ‘over the wire’, chances are that any of the serializers will be ‘fast enough’.
  • do you need cross language support? do you need code-gen for those languages?

In general, picking up a serializer is picking up your personal favorite. TLDR: don’t use XML. If you have to, use JSON. It’s easy to deal with when schemaless (although dates can pose some challenges). In C#, if you can don’t use JavascriptSerlializer, DataContractSerializer or DataContractJsonSerializer. They are on the slow side.

Goals

I want to determine time and size impact of different serializers. I am more concerned about the size, assuming that time will be around the same for majority of serializers. The goal of this post is not to document differences between different serialization stacks (analysis of languages, apis, etc.)

Serializers in the set

Almost all of serializers in this list support some kind of RPC on their own, I’ll skip that part from this analysis.

  • Newtonsoft.Json – it’s a good json serializer. It supports ‘DataContract’ attributes from System.Runtime.Serialization, it also supports BSON. (todo: BSON)
  • Thrift – thrift is a fully fledged, cross-language/platform ‘RPC’ library (service development), used (among others) by Twitter and Salesforce (also, combined with finagle). It has language agnostic data (and service) definition layer, which then transpiles to a specific language of your choice. Various serialization options are available.
  • Avro – data serialization system, but also provides RPC layer if needed. It relies on schemas (but it’s embedded with the message/data), but code does not have to be generated (unlike thrift and protobuf).
  • Bond – Microsoft’s (cross platform) serialization mechanism.
  • ProtoBuf(fers) (github) – very similar to thrift. Protobuf vs thrift on stackoverflow. (thrift is more RPC oriented)
  • MessagePack – can work without schema
  • JsonDataContract, JavascriptSerializer – built in c# serializers
  • DataContractSerializer (XML) – built in .net serializer
  • Binary formatter – built in binary formatter / serializer for .net. You’d probably never use it in production, as it doesn’t really provide any backward compatibiltiy in case of schema changes.

About the project

I used Benchmark.Net for performance experiments. While it puts some constraints on code layout, it not only measures performance, but it correctly prepares the performance measurement, it also measures approximate allocations and garbage collections.

If serializer cannot use standard data type, I use AutoMapper to map from my original type to the type of that serializer. Since some serializers don’t handle nulls, at some point I decided to not have nulls in my properties.

You can find sources on github.

The test data

I decided to use test two scenarios. One is an object that contains 4 strings. The other one contains binary data (in my case ‘binary’ is HTML). It is meant to represent a content fetched by a web fetcher. So the data contains Url (or string), Response Header (as text, since it’s ANSI), and Content (byte[]) — since the fetcher itself might not know what encoding to apply.

I generated two objects which I am using across all tests. Both objects are generated before the tests are performed. All instances of serializers are also created before the test begins – I assume one-time creation time is neglibile (even if it’s one-time per type).

What I didn’t test is how these serializers handle nested objects, cycles, etc. While all of them work fine with nested objects, they differ in cycle handling. Some of them are configurable with that regard. Note that almost always cycle (and reference) handling has additional performance impact, hence it was out of scope.

Size & compressability

One of the goals is to save size. Turns out that you cannot get really much better than untagged serialization that supports binary arrays as first class citizen. Of course there is an open question on how is inheritance implemented (if supported), but it’s outside of the scope of this document.

Let’s take a look at the basic object (4 strings). First column represents uncompressed size, second uses DeflateStream to compress the data. Third column represents the % of the size of the object compared to the largest one, and fourth column represents the % of the size of compressed object to the largest uncompressed object.

As we can see, all of the ‘top of the line’ serializers produce very similar sized results (well, to be frank, with 4 strings there is not much rocket science in that).

uncompressed bytes compressed (optimal) % of max % of max compressed
MessagePack 710 457 71% 46%
Avro 708 457 71% 46%
ThriftCompact 713 460 71% 46%
Proto3 712 463 71% 46%
BondUnsafeSimpleCopied 719 464 72% 46%
BondUnsafeCompactReused 717 469 72% 47%
ThriftBinary 732 478 73% 48%
NewtonsoftJsonReusedSerializer 774 488 77% 49%
JavascriptSerializer 774 488 77% 49%
DataContractJsonSerializer 846 497 84% 50%
Xml 989 634 99% 63%
BinaryFormatter 1002 649 100% 65%

When looking at larger messages with binary content we see similar results (top-of-the-line serializers taking roughly the same size). I added two more columns that ignore the outlier javascript serializers (those serializers serialize byte[] to array of bytes represented by strings).

uncompressed bytes compressed (optimal) % of max % of max compressed % of max without outliers % of max without outliers compressed
MessagePack 314822 49330 29% 4% 75% 12%
Avro 314817 49309 29% 4% 75% 12%
ThriftCompact 314821 49320 29% 4% 75% 12%
Proto3 314822 49325 29% 4% 75% 12%
BondUnsafeSimpleCopied 314823 49322 29% 4% 75% 12%
BondUnsafeCompactReused 314825 49332 29% 4% 75% 12%
ThriftBinary 314833 49341 29% 4% 75% 12%
NewtonsoftJsonReusedSerializer 419349 84895 38% 8% 100% 20%
JavascriptSerializer 1101496 78217 100% 7% 263% 19%
DataContractJsonSerializer 1101719 78238 100% 7% 263% 19%
Xml 419617 85045 38% 8% 100% 20%
BinaryFormatter 315231 49551 29% 4% 75% 12%

Sizewise, all of the serializers perform roughly the same. The largest difference is coming from the fact that byte[] does not have to be represented as base64 (newtonsoft.json), or as string of bytes (javascript serializer, data contract serializer). Just to put it into perspective, uncompressed JSON is 33% bigger than uncompressed-anything-else (base64 FTW), in compressed size the difference is even larger (50%+).

Performance

Since we established that size-wise the differences are minor, let’s look at the performance results. Not all serializers were tested in the same conditions, but more on that a bit later.

On the large objects, the serializers are working on the order of 100-400 us, with built in c# serializers being the slowest (outside of that range), newtonsoft.json being slow as well, and the rest being not that far from each other. The fastest serializer (Bond) reach 16us per serialization, which seems crazy. Note however, that it didn’t involve allocation of a single byte (nor garbage collection). That’s because I configured it to reuse the buffer. I didn’t do it with other serializers, but if performance is important to you, you should consider using buffer pools to avoid unnecessary garbage collections. (on the other hand you still need to create your object which involves memory operations, so the difference in the big picture might not be that noticable.  Having said that, the slowest BondSimple with additional buffer copying is about as fast as Proto3 or Avro. Surprisingly, thrift is at the end of the peleton, and it looks like it allocates 2 times the required memory. (while Avro, Bond and Proto3 allocate around 600kB, Thrift & MessagePack allocate 1.2MB, and then are almost 2 times slower). It may be well because of how the MemoryStream works, and if it needs to expand, it will double its allocation.

Method Mean StdDev Gen 0 Gen 1 Gen 2 Allocated
NewtonsoftJsonReusedSerializer 1,456.6460 us 4.5955 us 324.4792 296.0938 292.7083 1.48 MB
NewtonsoftJsonGenericSerializer 1,256.3022 us 7.9291 us 321.6146 242.9688 163.8021 1.69 MB
NewtonsoftJsonDataContract 1,260.6177 us 7.7955 us 318.75 244.2708 161.9792 1.69 MB
Xml 716.6923 us 3.6916 us 276.1719 250.651 249.8698 1.32 MB
DataContractJsonSerializer 45,589.4428 us 141.1303 us 45.8333 45.8333 45.8333 4.57 MB
BondUnsafeCompact 265.6632 us 2.5722 us 152.7344 137.5 136.849 857.03 kB
BondUnsafeSimple 122.7984 us 0.6369 us 72.3307 56.7057 56.7057 382.2 kB
BondUnsafeSimpleCopied 225.2874 us 3.5498 us 126.6276 111.1328 111.0026 698.52 kB
BondUnsafeCompactReused 366.4184 us 5.9772 us 201.5625 185.8724 184.9609 1.17 MB
BondUnsafeCompactReusedCopied 372.5610 us 3.7343 us 209.375 193.6849 192.7734 1.17 MB
BondUnsafeSimpleReused 121.1955 us 1.3657 us 71.5169 55.9896 55.8919 382.13 kB
BondUnsafeSimpleReusedBuffer 16.2607 us 0.0823 us 0 B
JavascriptSerializer 70,611.3527 us 263.3033 us 1687.5 1062.5 125 14.46 MB
Proto3 213.8052 us 2.3267 us 108.0729 106.1198 106.1198 640.15 kB
BinaryFormatter 415.9039 us 5.1037 us 197.526 195.5729 195.5729 1.27 MB
MessagePack 381.4887 us 4.9364 us 190.3646 190.3646 190.3646 1.26 MB
Avro 219.0960 us 2.3580 us 108.2031 107.2266 107.2266 637.4 kB
ThriftBinary 379.5297 us 3.1255 us 192.3177 190.3646 190.3646 1.27 MB
ThriftCompact 388.4297 us 8.7005 us 193.3594 191.5365 191.5365 1.27 MB

On the smaller object, the performance was unmeasurable with default Benchmark.Net settings (it was too fast). I might come back to these tests later.

What really matters in terms of performance

Based on the test results, I’d risk to say that allocations & garbage collections have the largest impact on perf. Most performance problems come from allocation of memory. If you can avoid additional allocations, you will notice (in some cases), 50% performance improvements. If your application is serialization heavy, using buffer pools can significantly enhance your performance. Keep in mind that it might not matter in your application. Chances are that your logic is way more time consuming than the serialization itself.

I briefly touched on this point in previous paragraph, but let’s compare BondUnsafeSimple, BondUnsafeSimpleCopied, BondUnsafeSimpleCopiedReusedBuffer. First and second differ in that that “OutputBuffer” from Bond is copied from the buffer into array. (buffer has more capacity than the size of serialization). You probably won’t do that if you will be saving that object on disk or sending it over the wire. But you can see that the copy operation basically doubles memory allocation. Similarly, BondUnsafeSimpleReusedBuffer differs from BondUnsafeSimple in that, that it doesn’t even recreate “OutputBuffer” for subsequent serializations. Once the buffer grew to a certain size (and it gets reused), no more reallocations are required. This proves (or at least hints!) that majority of time in serialization is spent in memory allocation and not doing actual dumping of the data (especially when we are talking about copying byte[] into a stream).

Conclusion

If you are in a need of serializing data that contains binaries (as well as other properties), you have lots of choices. None of them involve ‘human readable’ representation. Any of the Avro, Thrift, MessagePack, Proto3 would do the trick. Seems that Avro, Proto3 and Bond might be standing out. Proto3 is well established and so is Bond (it is a public knowledge that Bond is used in scale infrastructure at Microsoft). I will be looking later on to see if there is something I am doing wrong with thrit that would cause it to have ~70% smaller performance (and higher memory usage) than the others.

What about deserialization?

Now when it comes to deserialization… there will be another article. One interesting property that I will want to check for in subsequent chapters is lazy deserialization. Sometimes when the object is large, you might want to deserialize (load to memory) just a part of it. This may be a bigger deal in things like javascript, where to load something to memory you not only have to read string but also interpret that, but nevertheless it might be an interesting property.

Leave a Reply

Your email address will not be published. Required fields are marked *