.data
April 1st, 2012 by Aleksandar Bradic


With the broader availability of HTTP Server Push API’s such as Twitter Streaming API creating data streams and piping them to regular Unix processes has become trivial, which provides opportunities for playful interactions with the rest of Unix toolbox. Tools such as sed & awk are particularly well suited for this task and they have been part of the standard distribution since Unix Version 7 (1979). Even more so, these tools are actually designed for stream-based processing (though with different “streams” in mind), so it is interesting to explore what good can they still do for us in 2011+
We give some simple examples of operations on data streams using sed in combination with the Twitter Spritzer Feed :
1. Get text of all tweets in the stream :
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length' -u USER:PASS \
| sed -n 's/.*"text":\("[^"]*"\).*/\1/p'"@isay_dayo chillin.. juss chillin" "@BoeBoeThoe hahhaha followed by lil wayne -get too comfortable" "On to the next ...."
2. Get verbose print of all tweets in the stream:
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length' -u USER:PASS \
| sed -n 's/.*"retweet_count":\([^"]*\),.*"text":\("[^"]*"\).*"created_at":"\([^"]*\)".*"
screen_name":"\([^"]*\)".*"time_zone":"\([^"]*\)".*/\3 | \4 (\5) | \2 (\1 retweets)/p'Wed Jun 24 08:29:08 +0000 2009 agirprlaplanete (Paris) | "Fukushima : contamination marine et silence du gouvernement http:\/\/t.co\/rcTwdXk" (4 retweets) Thu Apr 16 18:06:15 +0000 2009 rjmoeller (Central Time (US & Canada)) | "This is funny, I don't care where you're from: http:\/\/t.co\/6yo6ACj" (1 retweets)
3. Get US-only tweets with retweet count > 5 :
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
| sed -n 's/.*"retweet_count":\([5-9][0-9]*\),.*"text":\("[^"]*"\).*"created_at":"\([^"]*\)".*
"screen_name":"\([^"]*\)".*"time_zone":"\([^"]*\) (US & Canada)[^"]*".*/\3 | \4 (\5) | \2 (\1
retweets)/p'Thu Mar 15 09:26:10 +0000 2007 | sfslim (Pacific Time) | "Lesson learned? <90 people can paralyze a city transit system merely by leveraging the reputation of Anonymous. Fascinating\u2026 #PsyOps #OpBART" (75 retweets) Wed Jun 22 01:09:25 +0000 2011 | _BlackStewie (Eastern Time) | "A Wise Hoodrat once said.. \" (66 retweets)
4. Get all the http links appearing in tweets :
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
|sed -n 's/.*"text":\("[^"]*http:\\[\/]*\([^"]*\)\\\/\([^" ]*\)"\).*/http:\/\/\2\/\3/p'http://t.co/dvtPgcG http://de.tk/0ijcS http://t.co/M4mKFV8 http://t.co/KftSofI
5. Get all the hashtags in tweets :
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
| sed -n 's/.*"text":\("[^"]*\(#[^" ]*\)"\).*/\2/p'#ojkbot #worstfeeling #raganswa #AngerOnAuto #Nostalgia
(note that these are just ad hoc ideas and not tested in great detail)
Now, once we complete creation of such Sed-filtered streams, we can hook these up to the rest of standard Unix tools, and that’s where the real fun begins …
A piece of concept design for something we have been playing with as a part of our (upcoming) Sprawl Voice project :

stay tuned …

Apache Avro represents an important entry in the expanding set of serialization systems (Thrift, Protobuf, Etch..). What might make it appealing to an eye at first sight is its all-JSON focus. JSON is both a format-of-choice for schema definition and optional format for data serialization (in addition to the binary format). Those interested in benefits of such format (human-readable, line-serializable, standard, easy to integrate) - might immediately be sold on this aspect alone.
However, getting up to speed with Avro for simple local serialization might not be as straightforward (mostly due to the lack of examples). We give an example of using Avro with Java for simple local serialization and discuss some potential pitfalls. We consider a trivial example of serializing to disk social graph dataset mentioned in previous post.
In order to get started on building your Java projects with Avro support you need to either obtain the following jars: avro-1.3.1.jar, jackson-mapper-asl.jar, jackson-core-asl.jar from official Avro release page or (if you’re using Maven) add the following artifact to your project:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>avro</artifactId>
<version>1.3.1</version>
<scope>compile</scope>
</dependency>Once Avro support is in place, we can start by describing given data format using simple Avro schema:
{
"namespace": "test.avro",
"name": "FacebookUser",
"type": "record",
"fields": [
{"name": "name", "type": "string"},
{"name": "num_likes", "type": "int"},
{"name": "num_photos", "type": "int"},
{"name": "num_groups", "type": "int"} ]
}This schema should be sufficient for simple file format disk serialization (no RPC details).
A convenient feature of Avro is that it enables direct serialization from schema without code generation. We can easily perform JSON-serialization of data defined by schema above using the following code snippet:
String schemaDescription =
" { \n" +
" \"name\": \"FacebookUser\", \n" +
" \"type\": \"record\",\n" +
" \"fields\": [\n" +
" {\"name\": \"name\", \"type\": \"string\"},\n" +
" {\"name\": \"num_likes\", \"type\": \"int\"},\n" +
" {\"name\": \"num_photos\", \"type\": \"int\"},\n" +
" {\"name\": \"num_groups\", \"type\": \"int\"} ]\n" +
"}";
Schema s = Schema.parse(schemaDescription);
ByteArrayOutputStream bao = new ByteArrayOutputStream();
GenericDatumWriter w = new GenericDatumWriter(s);
Encoder e = new JsonEncoder(s, bao);
e.init(new FileOutputStream(new File("test_data.avro")));
GenericRecord r = new GenericData.Record(s);
r.put("name", new org.apache.avro.util.Utf8("Doctor Who"));
r.put("num_likes", 1);
r.put("num_photos", 0);
r.put("num_groups", 423);
w.write(r, e);
e.flush();Of course, adding schema directly to the code does not look particularly attractive, so the preferred use case is writing schema to separate config file and using:
Schema s = Schema.parse(new File("schema_path/fb_user.avpr");Additionally, in case we want to use binary, instead of JSON serialization, we simply have to change the Encoder implementation we will be using. In case of binary encoder, that is:
Encoder e = new BinaryEncoder(bao);
In practice, JSON serialization can be used for debugging purposes, when data volume is low or when we simply want to (ab)use Avro as a general JSON-serialization layer. However, for the purposes of large-volume data processing and archival, binary format is the preferred option due to the fact that json-serialization adds certain memory size overhead. However, this overhead is variable depending on the actual data values being serialized. The following graphs illustrates this, for the case of trivial data format given in this example, for various lengths of string and integer elements using json and binary encoding (uncompressed) :

In addition to using Avro for the on-the fly serialization as described above, with statically-typed language such as Java, we often want to go for class generation.
Avro enables class generation from .avpr descriptions using org.apache.avro.specific.SpecificCompiler class, either from command line as:
org.apache.avro.specific.SpecificCompiler [avpr file]
or from code by specifying source schema and output directory:
SpecificCompiler.compileSchema(new File("fb_user.avpr"), new File("src/avro/generated/"));Classes generated in this manner implement SpecificRecord interface with three accessor methods to interface the data :
* getSchema() - returning Schema object corresponding to structure of serialized data
* get(int i) - returning Object corresponding to the value of field at given position in schema
* put(int i, Object v) - allowing for setting the value of field at given position in the schema
By leveraging obtained Schema data - we can easily determine appropriate field indexes and retrieve desired data from serialized objects.
Convenient side-effect of storing schema alongside with serialized data is that it vastly simplifies handling of versioning of of data format. Namely, when processing historical data collection, we can simply detect format change by comparing Schema objects, and use them to resolve any collisions that might arise:
Schema s = Schema.parse(new File("src/data/avro/sample/fb_user.avpr"));
GenericDatumReader<GenericRecord> r = new GenericDatumReader<GenericRecord>(s);
Decoder decoder = new JsonDecoder(s, new FileInputStream(new File("test_data_json.avro")));
GenericRecord rec = (GenericRecord)r.read(null, decoder);
if (s.equals(rec.getSchema())) {
// handle regular fields
} else {
// handle differences
}In addition to describing simple schemas such as the one in this example, Avro specification enables us to define far more complex types. For example, a model more suitable for graph data description might take the following form:
{
"namespace": "test.avro",
"name": "FacebookUser",
"type": "record",
"fields": [
{"name": "name", "type": "string"},
{"name": "num_likes", "type": "int"},
{"name": "num_photos", "type": "int"},
{"name": "num_groups", "type": "int"} ],
{"name": "friends", "type": "array", "items": "FacebookUser"} ]
}A common pitfall when describing large schemas is not accounting for possible unknown values of fields. Attempting to serialize objects with not all Utf8 fields set will result in null pointer exception:
java.lang.NullPointerException at org.apache.avro.io.JsonEncoder.writeString(JsonEncoder.java:117) at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
In order to mitigate this, we need to indicate in the schema that it’s valid for certain fields in object not to have a value set (if this is indeed the case). We do this by declaring fields in schema as having optional null value. Schema from the example that allows for “name” field to have null value will take the following form:
{
"namespace": "test.avro",
"name": "FacebookUser",
"type": "record",
"fields": [
{"name": "name", "type": ["string", "null"] },
{"name": "num_likes", "type": "int"},
{"name": "num_photos", "type": "int"},
{"name": "num_groups", "type": "int"} ]
}Another beautiful side-effect of Avro schema format is that all of attributes in schema which have non-keyword names are ignored by the compiler:
{
"namespace": "test.avro",
"name": "FacebookUser",
"type": "record",
"fields": [
{"name": "name", "type": ["string",null], "format" : "name/surname" },
{"name": "num_likes", "type": "int", "min" : 3},
{"name": "num_photos", "type": "int", "avg" : 12},
{"name": "num_groups", "type": "int", "max" : 9 } ]
}This enables us to (ab)use this information as metadata in a number of ways - from extending avro to a general “data modeling” language, describing interdependencies between various objects in complex systems to being a sort of “annotation” that can support “high-order” data processing. However, these “metadata” attributes are not stored in the schema serialized along with the data, which limits their “runtime” potential - for example as “annotations” for resolving complex versioning/data migration issues.
(a sequel post will follow)

Recently introduced Facebook Graph API represents an interesting source of data with a nice easy-to-appreciate context to it (everyone loves social). In order to motivate some of the examples in the blog, I have written up a simple quick&dirty Graph API client in Java :
that provides trivial-to-use interface for graph data processing :
SimpleGraphAPIClient client = new SimpleGraphAPIClient(fbToken);
LinkedList<FacebookUser> friends = client.getFriends();
for (FacebookUser friend : friends) {
LinkedList<LikedEntry> likes = client.getLikes(friend.getID());
LinkedList<PhotoEntry> photos = client.getPhotos(friend.getID());
LinkedList<GroupEntry> groups = client.getGroups(friend.getID());
// arbitrary dataset creation logic
}From a pure tool-perspective (without actually having an active fb app) - the dataset that can be generated is quite limited (bounded to the “neighborhood” of single user) - but even with that, a lot of interesting “play” data can be derived. For example, at minimum, we can get a (num_likes, num_photos, num_groups) data for all “friend” users and to that we can add some “derived” metrics like average group size, photo age, etc. Modeling this data alone can motivate some very interesting problems.
Here is a simple plot of (num_likes, num_photos, num_groups) dataset of 170 anonymous users obtained in this manner:

(Note - some of the data that Graph API returns occassionaly doesn’t match actual state on the site - so some outliers might be just missing data on fb side. However, this (systematic bias) is what might make the dataset especially interesting
)

Nonparametric regression aims at modeling relation between predictors and dependent variable without any assumptions on specific form of the dependency function:
Unlike classical linear regression, where we the goal is determining parameters of assumed linear function, with nonparametric regression, the goal is estimating the entire regression function directly. Depending on the assumptions on the structure of underlying data, a number of methods exist that achieve optimality of estimation. We give a overview of several methods and explain their practical usage in R. In doing so, we make use of the social graph data described in recent post.
Local Regression
LOWESS (Locally Weighted Scatterplot Smoothing) algorithm is based on idea of local linear regression. The general approach of local regression is fitting simple models to “local” subsets of data and combining the results to determine the regression function for entire dataset. In this this method, for modelling “local” data we use weighted least squares polynomial fit of general form :
where the “local” observations are weighted by their proximity to “focal” value
..

> plot(num_groups, num_likes) > lines(lowess(num_groups ~ num_likes, f = 2/3, iter=4),col = 2)
The effect of span window for (f=1/16, f=1/8, f=1/4, f=1/2 ) :

(in progress…)
As time moves on, we’re finally shaping up to to move towards the v0.1 of the framework. Some of the essential features of this release include :
Years of open soruce software development, aimed at perfecting the engineering craft and providing alternative and resources for new developers to learn, though sucessfull mission, have failed to provide side effects that some of us hoped for. The hope was that the open software would enable to the many to express the creativity, and start creating new qualities of software, exploring new territories, and moving the experience of working on building software, from craft to something more closely related to an art.
If we observe the similar process, as happening in music, in the last century, is exactly that by providing tools for creating music available to many - it became possible for many to create new qualities, regardless of their skillset, as long as they can formulate a long-lasting “art” aspect of it. A lot of indie music today is aimed at creating something “new” rather than creating somehting “perfect”. Even though a lot of this would end up on the dumpster, the long-term effect is explosion of the field it that more and more of “new terittories” are being explored, and every once a while a flags have been raised. This makes all of our lives richer and helps us advance, which was the meaning of it all.
Naturally, a call for a similar initiative in software is needed - a call for creating a software aimed at creating something “new” - rather than creating something “better”, “cheaper” - a call for creating something with art-like qualities to it , with the sense of individual expression and exploration.
… to be continued