azure-sdk-for-net: Avro Serializer Very Slow

Issue When serializing a test object with both the Avro serializer and JSON serializer I found a huge difference in performance. The JSON serializer serialized 2,000,000 (2M) objects in about 36 seconds while the Avro serializer only serialized 100 objects in 45 seconds. That makes it really unusable for processing large amounts of data.

image

Code Example Note: The serialization methods below are getting a schema defined in Azure and assigning it to the Schema property of the test object. However, the test object also provides the same schema as a hard coded string. The code fetching the schema can be removed. It does not make any applicable difference in performance.

` internal void TestSchema() { var avroCount = 100; var jsonCount = 2000000; var bytes = AvroSerialize(avroCount); bytes = JsonSerialize(jsonCount); }

   private byte[] AvroSerialize(int max)
    {
        var randomGenerator = new Random();
        var schemaClient = GetSchemaRegistryClient();
        var s = schemaClient.GetSchema("MY_SCHEMA_GUID_HERE");
        var schema = Avro.Schema.Parse(s.Value.Content);
        byte[] bytes;
        var start = DateTime.Now;
        using (var memoryStream = new MemoryStream())
        {
            var serializer = new SchemaRegistryAvroObjectSerializer(schemaClient, "TestDataSchema", new SchemaRegistryAvroObjectSerializerOptions { AutoRegisterSchemas = true });
            for (var id = 1; id <= max; id++)
            {
                var person = new Person() { PersonId = id, PersonName = $"SomeName{id}", PersonDate = DateTime.Now, Schema = schema };
                serializer.Serialize(memoryStream, person, typeof(Person), CancellationToken.None);
            }
            bytes = memoryStream.ToArray();
        }
        var finish = DateTime.Now;
        var elapsed = finish - start;
        Console.WriteLine($"Avro serialized {max} Person objects in {elapsed}");
        return bytes;
    }

    private byte[] JsonSerialize(int max)
    {
        var randomGenerator = new Random();
        var schemaClient = GetSchemaRegistryClient();
        var s = schemaClient.GetSchema("MY_SCHEMA_GUID_HERE");
        var schema = Avro.Schema.Parse(s.Value.Content);
        byte[] bytes;
        var start = DateTime.Now;
        using (var memoryStream = new MemoryStream())
        {
            using (var sw = new StreamWriter(memoryStream, new UTF8Encoding(false), 1024, leaveOpen: true))
            {
                for (var id = 1; id <= max; id++)
                {
                    var person = new Person() { PersonId = id, PersonName = $"SomeName{id}", PersonDate = DateTime.Now, Schema = schema };
                    var serialized = JsonConvert.SerializeObject(person);
                    sw.Write(serialized);
                    sw.Flush();
                }
            }
            bytes = memoryStream.ToArray();
            var finish = DateTime.Now;
            var elapsed = finish - start;
            Console.WriteLine($"JSON serialized {max} Person objects in {elapsed}");
        }
        return bytes;
    }

public class Person : ISpecificRecord
{
    public static Schema _SCHEMA = Avro.Schema.Parse(@{\"type\":\"record\",\"name\":\"Person\",\"namespace\":\"DataWriterDriver\",\"fields\":[{\"name\":\"PersonId\",\"type\":\"int\"},{\"name\":\"PersonName\",\"type\":\"string\"},

{"name":"PersonDate","type":"long","logicalType":"local-timestamp-millis"}]}"); public int PersonId { get; set; }

    public string PersonName { get; set; }

    public DateTime PersonDate { get; set; }

    public Schema Schema { get; set; }

    public Person()
    {
        Schema = _SCHEMA;
    }

    public object Get(int fieldPos)
    {
        switch (fieldPos)
        {
            case 0: return PersonId;
            case 1: return PersonName;
            case2:
                var offset = new DateTimeOffset(PersonDate);
                return offset.ToUnixTimeMilliseconds();
            default: throw new AvroRuntimeException("Bad index " + fieldPos + " in Get()");
        };
    }

    public void Put(int fieldPos, object fieldValue)
    {
        switch (fieldPos)
        {
            case 0: PersonId = (int)fieldValue; break;
            case 1: PersonName = (string)fieldValue; break;
            case 2:
                if (fieldValue is long unixTicks)
                {
                    PersonDate = DateTimeOffset.FromUnixTimeMilliseconds(unixTicks).DateTime;
                }
                break;
            default: throw new AvroRuntimeException("Bad index " + fieldPos + " in Put()");
        };
    }

    public override string ToString()
    {
        return $"PersonID:{PersonId} PersonName:{PersonName} PersonDate:{PersonDate}";
    }
}

` Environment: Azure.Data.SchemaRegistry.ApacheAvro --version 1.0.0-beta.2 WIndows 10 .NET Core 3.1 Visual Studio 16.7.5

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Hi @jkmbowachas, Can you let us know what version you are using? Can you provide a snippet of the code you are using to test?

@LarryF813, The version that you were using did not have caching of schema Ids, so each serialize call would need to talk to the service to look up the schema ID. The current beta, has caching along with some fundamental changes to the API.