Archive for May, 2010

Apache Avro in practice

Monday, May 3rd, 2010

screen-shot-2010-04-24-at-11659-pm

Apache Avro represents an important entry in the expanding set of serialization systems (Thrift, Protobuf, Etch..). What might make it appealing to an eye at first sight is its all-JSON focus. JSON is both a format-of-choice for schema definition and optional format for data serialization (in addition to the binary format). Those interested in benefits of such format (human-readable, line-serializable, standard, easy to integrate) - might immediately be sold on this aspect alone.

However, getting up to speed with Avro for simple local serialization might not be as straightforward (mostly due to the lack of examples). We give an example of using Avro with Java for simple local serialization and discuss some potential pitfalls. We consider a trivial example of serializing to disk social graph dataset mentioned in previous post.

In order to get started on building your Java projects with Avro support you need to either obtain the following jars: avro-1.3.1.jar, jackson-mapper-asl.jar, jackson-core-asl.jar from official Avro release page or (if you’re using Maven) add the following artifact to your project:

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>avro</artifactId>
            <version>1.3.1</version>
            <scope>compile</scope>
        </dependency>

Once Avro support is in place, we can start by describing given data format using simple Avro schema:

 {
      "namespace": "test.avro",
      "name": "FacebookUser",
      "type": "record",
      "fields": [
          {"name": "name", "type": "string"},
          {"name": "num_likes", "type": "int"},
          {"name": "num_photos", "type": "int"},
          {"name": "num_groups", "type": "int"} ]
}

This schema should be sufficient for simple file format disk serialization (no RPC details).

A convenient feature of Avro is that it enables direct serialization from schema without code generation. We can easily perform JSON-serialization of data defined by schema above using the following code snippet:

    String schemaDescription =
            " {    \n" +
                    " \"name\": \"FacebookUser\", \n" +
                    " \"type\": \"record\",\n" +
                    " \"fields\": [\n" +
                    "   {\"name\": \"name\", \"type\": \"string\"},\n" +
                    "   {\"name\": \"num_likes\", \"type\": \"int\"},\n" +
                    "   {\"name\": \"num_photos\", \"type\": \"int\"},\n" +
                    "   {\"name\": \"num_groups\", \"type\": \"int\"} ]\n" +
                    "}";
 
    Schema s = Schema.parse(schemaDescription);
 
    ByteArrayOutputStream bao = new ByteArrayOutputStream();
    GenericDatumWriter w = new GenericDatumWriter(s);
    Encoder e = new JsonEncoder(s, bao);
    e.init(new FileOutputStream(new File("test_data.avro")));
 
    GenericRecord r = new GenericData.Record(s);
    r.put("name", new org.apache.avro.util.Utf8("Doctor Who"));
    r.put("num_likes", 1);
    r.put("num_photos", 0);
    r.put("num_groups", 423);
 
    w.write(r, e);
    e.flush();

Of course, adding schema directly to the code does not look particularly attractive, so the preferred use case is writing schema to separate config file and using:

Schema s = Schema.parse(new File("schema_path/fb_user.avpr");

Additionally, in case we want to use binary, instead of JSON serialization, we simply have to change the Encoder implementation we will be using. In case of binary encoder, that is:

Encoder e = new BinaryEncoder(bao);

In practice, JSON serialization can be used for debugging purposes, when data volume is low or when we simply want to (ab)use Avro as a general JSON-serialization layer. However, for the purposes of large-volume data processing and archival, binary format is the preferred option due to the fact that json-serialization adds certain memory size overhead. However, this overhead is variable depending on the actual data values being serialized. The following graphs illustrates this, for the case of trivial data format given in this example, for various lengths of string and integer elements using json and binary encoding (uncompressed) :

avro_serialization1

In addition to using Avro for the on-the fly serialization as described above, with statically-typed language such as Java, we often want to go for class generation.

Avro enables class generation from .avpr descriptions using org.apache.avro.specific.SpecificCompiler class, either from command line as:

org.apache.avro.specific.SpecificCompiler [avpr file]

or from code by specifying source schema and output directory:

SpecificCompiler.compileSchema(new File("fb_user.avpr"), new File("src/avro/generated/"));

Classes generated in this manner implement SpecificRecord interface with three accessor methods to interface the data :

* getSchema() - returning Schema object corresponding to structure of serialized data
* get(int i) - returning Object corresponding to the value of field at given position in schema
* put(int i, Object v) - allowing for setting the value of field at given position in the schema

By leveraging obtained Schema data - we can easily determine appropriate field indexes and retrieve desired data from serialized objects.

Convenient side-effect of storing schema alongside with serialized data is that it vastly simplifies handling of versioning of of data format. Namely, when processing historical data collection, we can simply detect format change by comparing Schema objects, and use them to resolve any collisions that might arise:

    Schema s = Schema.parse(new File("src/data/avro/sample/fb_user.avpr"));
    GenericDatumReader<GenericRecord> r = new GenericDatumReader<GenericRecord>(s);
    Decoder decoder = new JsonDecoder(s, new FileInputStream(new File("test_data_json.avro")));
    GenericRecord rec = (GenericRecord)r.read(null, decoder);
    if (s.equals(rec.getSchema())) {
        // handle regular fields
    } else {
        // handle differences
    }

In addition to describing simple schemas such as the one in this example, Avro specification enables us to define far more complex types. For example, a model more suitable for graph data description might take the following form:

{
      "namespace": "test.avro",
      "name": "FacebookUser",
      "type": "record",
      "fields": [
          {"name": "name", "type": "string"},
          {"name": "num_likes", "type": "int"},
          {"name": "num_photos", "type": "int"},
          {"name": "num_groups", "type": "int"} ],
          {"name": "friends", "type": "array", "items": "FacebookUser"} ]
}

A common pitfall when describing large schemas is not accounting for possible unknown values of fields. Attempting to serialize objects with not all Utf8 fields set will result in null pointer exception:

java.lang.NullPointerException
	at org.apache.avro.io.JsonEncoder.writeString(JsonEncoder.java:117)
	at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
	at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)

In order to mitigate this, we need to indicate in the schema that it’s valid for certain fields in object not to have a value set (if this is indeed the case). We do this by declaring fields in schema as having optional null value. Schema from the example that allows for “name” field to have null value will take the following form:

{
      "namespace": "test.avro",
      "name": "FacebookUser",
      "type": "record",
      "fields": [
          {"name": "name", "type": ["string", "null"] },
          {"name": "num_likes", "type": "int"},
          {"name": "num_photos", "type": "int"},
          {"name": "num_groups", "type": "int"} ]
}

Another beautiful side-effect of Avro schema format is that all of attributes in schema which have non-keyword names are ignored by the compiler:

{
      "namespace": "test.avro",
      "name": "FacebookUser",
      "type": "record",
      "fields": [
          {"name": "name", "type": ["string",null], "format"  : "name/surname" },
          {"name": "num_likes", "type": "int", "min" : 3},
          {"name": "num_photos", "type": "int", "avg" : 12},
          {"name": "num_groups", "type": "int", "max" : 9 } ]
}

This enables us to (ab)use this information as metadata in a number of ways - from extending avro to a general “data modeling” language, describing interdependencies between various objects in complex systems to being a sort of “annotation” that can support “high-order” data processing. However, these “metadata” attributes are not stored in the schema serialized along with the data, which limits their “runtime” potential - for example as “annotations” for resolving complex versioning/data migration issues.

(a sequel post will follow)

battery etc.

Saturday, May 1st, 2010

battery_time2