fastavro¶
The current Python avro package is dog slow.
On a test case of about 10K records, it takes about 14sec to iterate over all of them. In comparison the JAVA avro SDK does it in about 1.9sec.
fastavro is an alternative implementation that is much faster. It iterates over the same 10K records in 2.9sec, and if you use it with PyPy it’ll do it in 1.5sec (to be fair, the JAVA benchmark is doing some extra JSON encoding/decoding).
If the optional C extension (generated by Cython) is available, then fastavro will be even faster. For the same 10K records it’ll run in about 1.7sec.
Supported Features¶
File Writer
File Reader (iterating via records or blocks)
Schemaless Writer
Schemaless Reader
JSON Writer
JSON Reader
Codecs (Snappy, Deflate, Zstandard, Bzip2, LZ4, XZ)
Schema resolution
Aliases
Logical Types
Missing Features¶
Anything involving Avro’s RPC features
Parsing schemas into the canonical form
Schema fingerprinting
Example¶
from fastavro import writer, reader, parse_schema
schema = {
'doc': 'A weather reading.',
'name': 'Weather',
'namespace': 'test',
'type': 'record',
'fields': [
{'name': 'station', 'type': 'string'},
{'name': 'time', 'type': 'long'},
{'name': 'temp', 'type': 'int'},
],
}
parsed_schema = parse_schema(schema)
# 'records' can be an iterable (including generator)
records = [
{u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
{u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
{u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
{u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
]
# Writing
with open('weather.avro', 'wb') as out:
writer(out, parsed_schema, records)
# Reading
with open('weather.avro', 'rb') as fo:
for record in reader(fo):
print(record)