I’ve been recently looking into different serialization options. While there are plenty of writeups (even in C#) already available, I wanted to:

Have one about C#
Learn something new 😉
Look into my particular data distribution / characteristics
Understand not only performance, but also size impact

I’d say that there are very few summaries that you should just rely on. With some exception, serialization is generally pretty fast, and the choices you make have also following considerations:

do you need it to be ‘human readable’?
- Is it just enough to have a tool that presents the serialization for you?
- Do you need tagged or untagged serialization? Untagged serialization can be noticably faster and smaller.
  What if even the field names aren’t preserved in serialization? If you don’t have a schema agreed upon (ahead of time), some of customers of your data might not know how to interpret it (e.g. visualization tools for data).
do you care about maximum compactness of data representation?
how much do you really care about performance? if you are going to send it ‘over the wire’, chances are that any of the serializers will be ‘fast enough’.
do you need cross language support? do you need code-gen for those languages?

In general, picking up a serializer is picking up your personal favorite. TLDR: don’t use XML. If you have to, use JSON. It’s easy to deal with when schemaless (although dates can pose some challenges). In C#, if you can don’t use JavascriptSerlializer, DataContractSerializer or DataContractJsonSerializer. They are on the slow side.

Goals

I want to determine time and size impact of different serializers. I am more concerned about the size, assuming that time will be around the same for majority of serializers. The goal of this post is not to document differences between different serialization stacks (analysis of languages, apis, etc.)

Serializers in the set

Almost all of serializers in this list support some kind of RPC on their own, I’ll skip that part from this analysis.

Newtonsoft.Json – it’s a good json serializer. It supports ‘DataContract’ attributes from System.Runtime.Serialization, it also supports BSON. (todo: BSON)
Thrift – thrift is a fully fledged, cross-language/platform ‘RPC’ library (service development), used (among others) by Twitter and Salesforce (also, combined with finagle). It has language agnostic data (and service) definition layer, which then transpiles to a specific language of your choice. Various serialization options are available.
Avro – data serialization system, but also provides RPC layer if needed. It relies on schemas (but it’s embedded with the message/data), but code does not have to be generated (unlike thrift and protobuf).
Bond – Microsoft’s (cross platform) serialization mechanism.
ProtoBuf(fers) (github) – very similar to thrift. Protobuf vs thrift on stackoverflow. (thrift is more RPC oriented)
MessagePack – can work without schema
JsonDataContract, JavascriptSerializer – built in c# serializers
DataContractSerializer (XML) – built in .net serializer
Binary formatter – built in binary formatter / serializer for .net. You’d probably never use it in production, as it doesn’t really provide any backward compatibiltiy in case of schema changes.

About the project

I used Benchmark.Net for performance experiments. While it puts some constraints on code layout, it not only measures performance, but it correctly prepares the performance measurement, it also measures approximate allocations and garbage collections.

If serializer cannot use standard data type, I use AutoMapper to map from my original type to the type of that serializer. Since some serializers don’t handle nulls, at some point I decided to not have nulls in my properties.

You can find sources on github.

The test data

I decided to use test two scenarios. One is an object that contains 4 strings. The other one contains binary data (in my case ‘binary’ is HTML). It is meant to represent a content fetched by a web fetcher. So the data contains Url (or string), Response Header (as text, since it’s ANSI), and Content (byte[]) — since the fetcher itself might not know what encoding to apply.

I generated two objects which I am using across all tests. Both objects are generated before the tests are performed. All instances of serializers are also created before the test begins – I assume one-time creation time is neglibile (even if it’s one-time per type).

What I didn’t test is how these serializers handle nested objects, cycles, etc. While all of them work fine with nested objects, they differ in cycle handling. Some of them are configurable with that regard. Note that almost always cycle (and reference) handling has additional performance impact, hence it was out of scope.

Size & compressability

One of the goals is to save size. Turns out that you cannot get really much better than untagged serialization that supports binary arrays as first class citizen. Of course there is an open question on how is inheritance implemented (if supported), but it’s outside of the scope of this document.

Let’s take a look at the basic object (4 strings). First column represents uncompressed size, second uses DeflateStream to compress the data. Third column represents the % of the size of the object compared to the largest one, and fourth column represents the % of the size of compressed object to the largest uncompressed object.

As we can see, all of the ‘top of the line’ serializers produce very similar sized results (well, to be frank, with 4 strings there is not much rocket science in that).

	uncompressed bytes	compressed (optimal)	% of max	% of max compressed
MessagePack	710	457	71%	46%
Avro	708	457	71%	46%
ThriftCompact	713	460	71%	46%
Proto3	712	463	71%	46%
BondUnsafeSimpleCopied	719	464	72%	46%
BondUnsafeCompactReused	717	469	72%	47%
ThriftBinary	732	478	73%	48%
NewtonsoftJsonReusedSerializer	774	488	77%	49%
JavascriptSerializer	774	488	77%	49%
DataContractJsonSerializer	846	497	84%	50%
Xml	989	634	99%	63%
BinaryFormatter	1002	649	100%	65%

When looking at larger messages with binary content we see similar results (top-of-the-line serializers taking roughly the same size). I added two more columns that ignore the outlier javascript serializers (those serializers serialize byte[] to array of bytes represented by strings).

	uncompressed bytes	compressed (optimal)	% of max	% of max compressed	% of max without outliers	% of max without outliers compressed
MessagePack	314822	49330	29%	4%	75%	12%
Avro	314817	49309	29%	4%	75%	12%
ThriftCompact	314821	49320	29%	4%	75%	12%
Proto3	314822	49325	29%	4%	75%	12%
BondUnsafeSimpleCopied	314823	49322	29%	4%	75%	12%
BondUnsafeCompactReused	314825	49332	29%	4%	75%	12%
ThriftBinary	314833	49341	29%	4%	75%	12%
NewtonsoftJsonReusedSerializer	419349	84895	38%	8%	100%	20%
JavascriptSerializer	1101496	78217	100%	7%	263%	19%
DataContractJsonSerializer	1101719	78238	100%	7%	263%	19%
Xml	419617	85045	38%	8%	100%	20%
BinaryFormatter	315231	49551	29%	4%	75%	12%

Sizewise, all of the serializers perform roughly the same. The largest difference is coming from the fact that byte[] does not have to be represented as base64 (newtonsoft.json), or as string of bytes (javascript serializer, data contract serializer). Just to put it into perspective, uncompressed JSON is 33% bigger than uncompressed-anything-else (base64 FTW), in compressed size the difference is even larger (50%+).

Performance

Since we established that size-wise the differences are minor, let’s look at the performance results. Not all serializers were tested in the same conditions, but more on that a bit later.

On the large objects, the serializers are working on the order of 100-400 us, with built in c# serializers being the slowest (outside of that range), newtonsoft.json being slow as well, and the rest being not that far from each other. The fastest serializer (Bond) reach 16us per serialization, which seems crazy. Note however, that it didn’t involve allocation of a single byte (nor garbage collection). That’s because I configured it to reuse the buffer. I didn’t do it with other serializers, but if performance is important to you, you should consider using buffer pools to avoid unnecessary garbage collections. (on the other hand you still need to create your object which involves memory operations, so the difference in the big picture might not be that noticable. Having said that, the slowest BondSimple with additional buffer copying is about as fast as Proto3 or Avro. Surprisingly, thrift is at the end of the peleton, and it looks like it allocates 2 times the required memory. (while Avro, Bond and Proto3 allocate around 600kB, Thrift & MessagePack allocate 1.2MB, and then are almost 2 times slower). It may be well because of how the MemoryStream works, and if it needs to expand, it will double its allocation.

Method	Mean	StdDev	Gen 0	Gen 1	Gen 2	Allocated
NewtonsoftJsonReusedSerializer	1,456.6460 us	4.5955 us	324.4792	296.0938	292.7083	1.48 MB
NewtonsoftJsonGenericSerializer	1,256.3022 us	7.9291 us	321.6146	242.9688	163.8021	1.69 MB
NewtonsoftJsonDataContract	1,260.6177 us	7.7955 us	318.75	244.2708	161.9792	1.69 MB
Xml	716.6923 us	3.6916 us	276.1719	250.651	249.8698	1.32 MB
DataContractJsonSerializer	45,589.4428 us	141.1303 us	45.8333	45.8333	45.8333	4.57 MB
BondUnsafeCompact	265.6632 us	2.5722 us	152.7344	137.5	136.849	857.03 kB
BondUnsafeSimple	122.7984 us	0.6369 us	72.3307	56.7057	56.7057	382.2 kB
BondUnsafeSimpleCopied	225.2874 us	3.5498 us	126.6276	111.1328	111.0026	698.52 kB
BondUnsafeCompactReused	366.4184 us	5.9772 us	201.5625	185.8724	184.9609	1.17 MB
BondUnsafeCompactReusedCopied	372.5610 us	3.7343 us	209.375	193.6849	192.7734	1.17 MB
BondUnsafeSimpleReused	121.1955 us	1.3657 us	71.5169	55.9896	55.8919	382.13 kB
BondUnsafeSimpleReusedBuffer	16.2607 us	0.0823 us	–	–	–	0 B
JavascriptSerializer	70,611.3527 us	263.3033 us	1687.5	1062.5	125	14.46 MB
Proto3	213.8052 us	2.3267 us	108.0729	106.1198	106.1198	640.15 kB
BinaryFormatter	415.9039 us	5.1037 us	197.526	195.5729	195.5729	1.27 MB
MessagePack	381.4887 us	4.9364 us	190.3646	190.3646	190.3646	1.26 MB
Avro	219.0960 us	2.3580 us	108.2031	107.2266	107.2266	637.4 kB
ThriftBinary	379.5297 us	3.1255 us	192.3177	190.3646	190.3646	1.27 MB
ThriftCompact	388.4297 us	8.7005 us	193.3594	191.5365	191.5365	1.27 MB

On the smaller object, the performance was unmeasurable with default Benchmark.Net settings (it was too fast). I might come back to these tests later.

What really matters in terms of performance

Based on the test results, I’d risk to say that allocations & garbage collections have the largest impact on perf. Most performance problems come from allocation of memory. If you can avoid additional allocations, you will notice (in some cases), 50% performance improvements. If your application is serialization heavy, using buffer pools can significantly enhance your performance. Keep in mind that it might not matter in your application. Chances are that your logic is way more time consuming than the serialization itself.

I briefly touched on this point in previous paragraph, but let’s compare BondUnsafeSimple, BondUnsafeSimpleCopied, BondUnsafeSimpleCopiedReusedBuffer. First and second differ in that that “OutputBuffer” from Bond is copied from the buffer into array. (buffer has more capacity than the size of serialization). You probably won’t do that if you will be saving that object on disk or sending it over the wire. But you can see that the copy operation basically doubles memory allocation. Similarly, BondUnsafeSimpleReusedBuffer differs from BondUnsafeSimple in that, that it doesn’t even recreate “OutputBuffer” for subsequent serializations. Once the buffer grew to a certain size (and it gets reused), no more reallocations are required. This proves (or at least hints!) that majority of time in serialization is spent in memory allocation and not doing actual dumping of the data (especially when we are talking about copying byte[] into a stream).

Conclusion

If you are in a need of serializing data that contains binaries (as well as other properties), you have lots of choices. None of them involve ‘human readable’ representation. Any of the Avro, Thrift, MessagePack, Proto3 would do the trick. Seems that Avro, Proto3 and Bond might be standing out. Proto3 is well established and so is Bond (it is a public knowledge that Bond is used in scale infrastructure at Microsoft). I will be looking later on to see if there is something I am doing wrong with thrit that would cause it to have ~70% smaller performance (and higher memory usage) than the others.

What about deserialization?

Now when it comes to deserialization… there will be another article. One interesting property that I will want to check for in subsequent chapters is lazy deserialization. Sometimes when the object is large, you might want to deserialize (load to memory) just a part of it. This may be a bigger deal in things like javascript, where to load something to memory you not only have to read string but also interpret that, but nevertheless it might be an interesting property.

PSla Blog

Serialization choices