<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>void:search* blog</title>
	<atom:link href="http://blog.voidsearch.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.voidsearch.com</link>
	<description>information visualization and search :art collective</description>
	<pubDate>Tue, 03 Apr 2012 00:40:53 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>.data</title>
		<link>http://blog.voidsearch.com/uncategorized/recycling-data/</link>
		<comments>http://blog.voidsearch.com/uncategorized/recycling-data/#comments</comments>
		<pubDate>Mon, 02 Apr 2012 00:35:58 +0000</pubDate>
		<dc:creator>Aleksandar Bradic</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=1006</guid>
		<description><![CDATA[&#160;
&#160;

&#160;
&#160;
]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;">&nbsp;</p>
<p style="text-align: center;">&nbsp;</p>
<p style="text-align: center;"><img class="aligncenter" title=":recycle" src="http://30.media.tumblr.com/tumblr_lznztibvUN1rneeclo1_500.png" alt="" width="500" height="440" /></p>
<p style="text-align: center;">&nbsp;</p>
<p style="text-align: center;">&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/uncategorized/recycling-data/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Simple Twitter Stream Processing using Sed</title>
		<link>http://blog.voidsearch.com/stream-processing/simple-twitter-stream-processing-using-sed/</link>
		<comments>http://blog.voidsearch.com/stream-processing/simple-twitter-stream-processing-using-sed/#comments</comments>
		<pubDate>Fri, 19 Aug 2011 00:26:32 +0000</pubDate>
		<dc:creator>Aleksandar Bradic</dc:creator>
		
		<category><![CDATA[stream processing]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=653</guid>
		<description><![CDATA[With the broader availability of HTTP Server Push API&#8217;s such as Twitter Streaming API creating data streams and piping them to regular Unix processes has become trivial, which provides opportunities for playful interactions with the rest of Unix toolbox. Tools such as sed &#38; awk are particularly well suited for this task and they have [...]]]></description>
			<content:encoded><![CDATA[<p>With the broader availability of HTTP Server Push API&#8217;s such as <a href="https://dev.twitter.com/docs/streaming-api">Twitter Streaming API</a> creating data streams and piping them to regular Unix processes has become trivial, which provides opportunities for playful interactions with the rest of Unix toolbox. Tools such as <strong>sed </strong>&amp; <strong>awk </strong>are particularly well suited for this task and they have been part of the standard distribution since Unix Version 7 (1979). Even more so, these tools are actually designed for stream-based processing (though with different &#8220;streams&#8221; in mind), so it is interesting to explore what good can they still do for us in 2011+ <img src='http://blog.voidsearch.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>We give some simple examples of operations on data streams using sed in combination with the <a href="http://gnip.com/twitter/spritzer">Twitter Spritzer Feed</a> :</p>
<p><strong>1.</strong><em> Get text of all tweets in the stream </em>:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length' -u USER:PASS \
| sed -n 's/.*&quot;text&quot;:\(&quot;[^&quot;]*&quot;\).*/\1/p'</pre></div></div>


<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">&quot;@isay_dayo chillin.. juss chillin&quot;
&quot;@BoeBoeThoe hahhaha followed by lil wayne -get too comfortable&quot;
&quot;On to the next ....&quot;</pre></div></div>

<p><strong>2.</strong><em> Get verbose print of all tweets in the stream: </em></p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length' -u USER:PASS \
| sed -n 's/.*&quot;retweet_count&quot;:\([^&quot;]*\),.*&quot;text&quot;:\(&quot;[^&quot;]*&quot;\).*&quot;created_at&quot;:&quot;\([^&quot;]*\)&quot;.*&quot;
screen_name&quot;:&quot;\([^&quot;]*\)&quot;.*&quot;time_zone&quot;:&quot;\([^&quot;]*\)&quot;.*/\3 | \4 (\5) | \2 (\1 retweets)/p'</pre></div></div>


<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">Wed Jun 24 08:29:08 +0000 2009 agirprlaplanete (Paris) | &quot;Fukushima : contamination marine et 
silence du gouvernement http:\/\/t.co\/rcTwdXk&quot; (4 retweets)
Thu Apr 16 18:06:15 +0000 2009 rjmoeller (Central Time (US &amp; Canada)) | &quot;This is funny, I 
don't care where you're from: http:\/\/t.co\/6yo6ACj&quot; (1 retweets)</pre></div></div>

<p><strong>3.</strong><em> Get US-only tweets with retweet count &gt; 5 </em>:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
| sed -n 's/.*&quot;retweet_count&quot;:\([5-9][0-9]*\),.*&quot;text&quot;:\(&quot;[^&quot;]*&quot;\).*&quot;created_at&quot;:&quot;\([^&quot;]*\)&quot;.*
&quot;screen_name&quot;:&quot;\([^&quot;]*\)&quot;.*&quot;time_zone&quot;:&quot;\([^&quot;]*\) (US &amp; Canada)[^&quot;]*&quot;.*/\3 | \4 (\5) | \2 (\1 
retweets)/p'</pre></div></div>


<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">Thu Mar 15 09:26:10 +0000 2007 | sfslim (Pacific Time) | &quot;Lesson learned? &amp;lt;90 people can 
paralyze a city transit system merely by leveraging the reputation of Anonymous. 
Fascinating\u2026 #PsyOps #OpBART&quot; (75 retweets)
Wed Jun 22 01:09:25 +0000 2011 | _BlackStewie (Eastern Time) | &quot;A Wise Hoodrat once said.. \&quot;
(66 retweets)</pre></div></div>

<p><strong>4</strong>. Get all the http links appearing in tweets :</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
|sed -n 's/.*&quot;text&quot;:\(&quot;[^&quot;]*http:\\[\/]*\([^&quot;]*\)\\\/\([^&quot; ]*\)&quot;\).*/http:\/\/\2\/\3/p'</pre></div></div>


<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">http://t.co/dvtPgcG
http://de.tk/0ijcS
http://t.co/M4mKFV8
http://t.co/KftSofI</pre></div></div>

<p><strong>5</strong>. Get all the hashtags in tweets :</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
| sed -n 's/.*&quot;text&quot;:\(&quot;[^&quot;]*\(#[^&quot; ]*\)&quot;\).*/\2/p'</pre></div></div>


<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">#ojkbot
#worstfeeling
#raganswa
#AngerOnAuto
#Nostalgia</pre></div></div>

<p><em>(note that these are just ad hoc ideas and not tested in great detail)</em></p>
<p>Now, once we complete creation of such Sed-filtered streams, we can hook these up to the rest of standard Unix tools, and that&#8217;s where the real fun begins &#8230; <img src='http://blog.voidsearch.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/stream-processing/simple-twitter-stream-processing-using-sed/feed/</wfw:commentRss>
		</item>
		<item>
		<title>news::visualized &#124; take #02</title>
		<link>http://blog.voidsearch.com/infoviz/newsvisualized-take-02/</link>
		<comments>http://blog.voidsearch.com/infoviz/newsvisualized-take-02/#comments</comments>
		<pubDate>Mon, 15 Aug 2011 22:11:04 +0000</pubDate>
		<dc:creator>Aleksandar Bradic</dc:creator>
		
		<category><![CDATA[infoviz]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=832</guid>
		<description><![CDATA[A bit more of Sprawl Voice concept design :
.

.
]]></description>
			<content:encoded><![CDATA[<p>A bit more of <a href="http://labs.voidsearch.com/sprawl/">Sprawl Voice</a> concept design :</p>
<p>.</p>
<p><img class="aligncenter size-full wp-image-833" title="newswire_02_resized" src="http://blog.voidsearch.com/wp-content/uploads/2011/08/newswire_02_resized.png" alt="newswire_02_resized" width="501" height="640" /></p>
<p>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/infoviz/newsvisualized-take-02/feed/</wfw:commentRss>
		</item>
		<item>
		<title>news::visualized &#124; take #01</title>
		<link>http://blog.voidsearch.com/infoviz/newsvisualized-take-01/</link>
		<comments>http://blog.voidsearch.com/infoviz/newsvisualized-take-01/#comments</comments>
		<pubDate>Sun, 14 Aug 2011 01:48:35 +0000</pubDate>
		<dc:creator>Aleksandar Bradic</dc:creator>
		
		<category><![CDATA[infoviz]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=656</guid>
		<description><![CDATA[A piece of concept design for something we have been playing with as a part of our (upcoming) Sprawl Voice project :

stay tuned &#8230;
]]></description>
			<content:encoded><![CDATA[<p>A piece of concept design for something we have been playing with as a part of our (upcoming) <a href="http://labs.voidsearch.com/sprawl/">Sprawl Voice</a> project :</p>
<p><img class="aligncenter size-full wp-image-657" title="newswire_infographics1_resized2" src="http://blog.voidsearch.com/wp-content/uploads/2011/08/newswire_infographics1_resized2.png" alt="newswire_infographics1_resized2" width="725" height="560" /></p>
<p>stay tuned &#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/infoviz/newsvisualized-take-01/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Apache Avro in practice</title>
		<link>http://blog.voidsearch.com/bigdata/apache-avro-in-practice/</link>
		<comments>http://blog.voidsearch.com/bigdata/apache-avro-in-practice/#comments</comments>
		<pubDate>Mon, 03 May 2010 04:00:54 +0000</pubDate>
		<dc:creator>Aleksandar Bradic</dc:creator>
		
		<category><![CDATA[bigdata]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=110</guid>
		<description><![CDATA[
Apache Avro represents an important entry in the expanding set of serialization systems (Thrift, Protobuf, Etch..). What might make it appealing to an eye at first sight is its all-JSON focus. JSON is both a format-of-choice for schema definition and optional format for data serialization (in addition to the binary format). Those interested in benefits [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;"><img class="alignnone size-full wp-image-107" title="screen-shot-2010-04-24-at-11659-pm" src="http://blog.voidsearch.com/wp-content/uploads/2010/04/screen-shot-2010-04-24-at-11659-pm.png" alt="screen-shot-2010-04-24-at-11659-pm" width="540" height="82" /></p>
<p><strong><a href="http://hadoop.apache.org/avro/">Apache Avro</a></strong> represents an important entry in the expanding set of serialization systems (<em>Thrift, Protobuf, Etch..</em>). What might make it appealing to an eye at first sight is its all-JSON focus. JSON is both a format-of-choice for schema definition and optional format for data serialization (in addition to the binary format). Those interested in benefits of such format (human-readable, line-serializable, standard, easy to integrate) - might immediately be sold on this aspect alone.</p>
<p>However, getting up to speed with Avro for simple local serialization might not be as straightforward (mostly due to the lack of examples). We give an example of using Avro with Java for simple local serialization and discuss some potential pitfalls. We consider a trivial example of serializing to disk social graph dataset mentioned in previous <a href="http://blog.voidsearch.com/statistics/sampling-the-social-graph-using-facebook-graph-api/">post</a>.</p>
<p>In order to get started on building your Java projects with Avro support you need to either obtain the following jars: <em>avro-1.3.1.jar</em>, <em>jackson-mapper-asl.jar</em>, <em>jackson-core-asl.jar</em> from official Avro <a href="http://hadoop.apache.org/avro/releases.html">release page</a> or (if you&#8217;re using Maven) add the following artifact to your project:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">        &lt;dependency&gt;
            &lt;groupId&gt;org.apache.hadoop&lt;/groupId&gt;
            &lt;artifactId&gt;avro&lt;/artifactId&gt;
            &lt;version&gt;1.3.1&lt;/version&gt;
            &lt;scope&gt;compile&lt;/scope&gt;
        &lt;/dependency&gt;</pre></div></div>

<p>Once Avro support is in place, we can start by describing given <a href="http://blog.voidsearch.com/statistics/sampling-the-social-graph-using-facebook-graph-api/">data format</a> using simple Avro schema:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;"> {
      &quot;namespace&quot;: &quot;test.avro&quot;,
      &quot;name&quot;: &quot;FacebookUser&quot;,
      &quot;type&quot;: &quot;record&quot;,
      &quot;fields&quot;: [
          {&quot;name&quot;: &quot;name&quot;, &quot;type&quot;: &quot;string&quot;},
          {&quot;name&quot;: &quot;num_likes&quot;, &quot;type&quot;: &quot;int&quot;},
          {&quot;name&quot;: &quot;num_photos&quot;, &quot;type&quot;: &quot;int&quot;},
          {&quot;name&quot;: &quot;num_groups&quot;, &quot;type&quot;: &quot;int&quot;} ]
}</pre></div></div>

<p>This schema should be sufficient for simple file format disk serialization (no RPC details).</p>
<p>A convenient feature of Avro is that it enables direct serialization from schema without code generation. We can easily perform JSON-serialization of data defined by schema above using the following code snippet:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">    String schemaDescription =
            &quot; {    \n&quot; +
                    &quot; \&quot;name\&quot;: \&quot;FacebookUser\&quot;, \n&quot; +
                    &quot; \&quot;type\&quot;: \&quot;record\&quot;,\n&quot; +
                    &quot; \&quot;fields\&quot;: [\n&quot; +
                    &quot;   {\&quot;name\&quot;: \&quot;name\&quot;, \&quot;type\&quot;: \&quot;string\&quot;},\n&quot; +
                    &quot;   {\&quot;name\&quot;: \&quot;num_likes\&quot;, \&quot;type\&quot;: \&quot;int\&quot;},\n&quot; +
                    &quot;   {\&quot;name\&quot;: \&quot;num_photos\&quot;, \&quot;type\&quot;: \&quot;int\&quot;},\n&quot; +
                    &quot;   {\&quot;name\&quot;: \&quot;num_groups\&quot;, \&quot;type\&quot;: \&quot;int\&quot;} ]\n&quot; +
                    &quot;}&quot;;
&nbsp;
    Schema s = Schema.parse(schemaDescription);
&nbsp;
    ByteArrayOutputStream bao = new ByteArrayOutputStream();
    GenericDatumWriter w = new GenericDatumWriter(s);
    Encoder e = new JsonEncoder(s, bao);
    e.init(new FileOutputStream(new File(&quot;test_data.avro&quot;)));
&nbsp;
    GenericRecord r = new GenericData.Record(s);
    r.put(&quot;name&quot;, new org.apache.avro.util.Utf8(&quot;Doctor Who&quot;));
    r.put(&quot;num_likes&quot;, 1);
    r.put(&quot;num_photos&quot;, 0);
    r.put(&quot;num_groups&quot;, 423);
&nbsp;
    w.write(r, e);
    e.flush();</pre></div></div>

<p>Of course, adding schema directly to the code does not look particularly attractive, so the preferred use case is writing schema to separate config file and using:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">Schema s = Schema.parse(new File(&quot;schema_path/fb_user.avpr&quot;);</pre></div></div>

<p>Additionally, in case we want to use binary, instead of JSON serialization, we simply have to change the <em>Encoder</em> implementation we will be using. In case of binary encoder, that is:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">Encoder e = new BinaryEncoder(bao);</pre></div></div>

<p>In practice, JSON serialization can be used for debugging purposes, when data volume is low or when we simply want to (ab)use Avro as a general JSON-serialization layer. However, for the purposes of large-volume data processing and archival, binary format is the preferred option due to the fact that json-serialization adds certain memory size overhead. However, this overhead is variable depending on the actual data values being serialized. The following graphs illustrates this, for the case of trivial data format given in this example, for various lengths of string and integer elements using json and binary encoding (uncompressed) :</p>
<p style="text-align: left;"><img class="aligncenter size-full wp-image-423" title="avro_serialization1" src="http://blog.voidsearch.com/wp-content/uploads/2010/04/avro_serialization1.jpg" alt="avro_serialization1" width="516" height="509" /></p>
<p>In addition to using Avro for the on-the fly serialization as described above, with statically-typed language such as Java, we often want to go for class generation. </p>
<p>Avro enables class generation from .avpr descriptions using <em>org.apache.avro.specific.SpecificCompiler</em> class, either from command line as:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">org.apache.avro.specific.SpecificCompiler [avpr file]</pre></div></div>

<p>or from code by specifying source schema and output directory:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">SpecificCompiler.compileSchema(new File(&quot;fb_user.avpr&quot;), new File(&quot;src/avro/generated/&quot;));</pre></div></div>

<p>Classes generated in this manner implement <em>SpecificRecord</em> interface with three accessor methods to interface the data :</p>
<p>* <strong>getSchema()</strong> - returning <em>Schema</em> object corresponding to structure of serialized data<br />
* <strong>get(int i)</strong> - returning <em>Object</em> corresponding to the value of field at given position in schema<br />
* <strong>put(int i, Object v)</strong> - allowing for setting the value of field at given position in the schema</p>
<p>By leveraging obtained <em>Schema</em> data - we can easily determine appropriate field indexes and retrieve desired data from serialized objects. </p>
<p>Convenient side-effect of storing schema alongside with serialized data is that it vastly simplifies handling of versioning of of data format. Namely, when processing historical data collection, we can simply detect format change by comparing <em>Schema</em> objects, and use them to resolve any collisions that might arise:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">    Schema s = Schema.parse(new File(&quot;src/data/avro/sample/fb_user.avpr&quot;));
    GenericDatumReader&lt;GenericRecord&gt; r = new GenericDatumReader&lt;GenericRecord&gt;(s);
    Decoder decoder = new JsonDecoder(s, new FileInputStream(new File(&quot;test_data_json.avro&quot;)));
    GenericRecord rec = (GenericRecord)r.read(null, decoder);
    if (s.equals(rec.getSchema())) {
        // handle regular fields
    } else {
        // handle differences
    }</pre></div></div>

<p>In addition to describing simple schemas such as the one in this example, Avro <a href="http://hadoop.apache.org/avro/docs/1.3.2/spec.html">specification</a> enables us to define far more complex types. For example, a model more suitable for graph data description might take the following form:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">{
      &quot;namespace&quot;: &quot;test.avro&quot;,
      &quot;name&quot;: &quot;FacebookUser&quot;,
      &quot;type&quot;: &quot;record&quot;,
      &quot;fields&quot;: [
          {&quot;name&quot;: &quot;name&quot;, &quot;type&quot;: &quot;string&quot;},
          {&quot;name&quot;: &quot;num_likes&quot;, &quot;type&quot;: &quot;int&quot;},
          {&quot;name&quot;: &quot;num_photos&quot;, &quot;type&quot;: &quot;int&quot;},
          {&quot;name&quot;: &quot;num_groups&quot;, &quot;type&quot;: &quot;int&quot;} ],
          {&quot;name&quot;: &quot;friends&quot;, &quot;type&quot;: &quot;array&quot;, &quot;items&quot;: &quot;FacebookUser&quot;} ]
}</pre></div></div>

<p>A common pitfall when describing large schemas is not accounting for possible unknown values of fields. Attempting to serialize objects with not all Utf8 fields set will result in null pointer exception:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">java.lang.NullPointerException
	at org.apache.avro.io.JsonEncoder.writeString(JsonEncoder.java:117)
	at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
	at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)</pre></div></div>

<p>In order to mitigate this, we need to indicate in the schema that it&#8217;s valid for certain fields in object not to have a value set (if this is indeed the case). We do this by declaring fields in schema as having optional <em>null</em> value. Schema from the example that allows for &#8220;name&#8221; field to have null value will take the following form:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">{
      &quot;namespace&quot;: &quot;test.avro&quot;,
      &quot;name&quot;: &quot;FacebookUser&quot;,
      &quot;type&quot;: &quot;record&quot;,
      &quot;fields&quot;: [
          {&quot;name&quot;: &quot;name&quot;, &quot;type&quot;: [&quot;string&quot;, &quot;null&quot;] },
          {&quot;name&quot;: &quot;num_likes&quot;, &quot;type&quot;: &quot;int&quot;},
          {&quot;name&quot;: &quot;num_photos&quot;, &quot;type&quot;: &quot;int&quot;},
          {&quot;name&quot;: &quot;num_groups&quot;, &quot;type&quot;: &quot;int&quot;} ]
}</pre></div></div>

<p>Another beautiful side-effect of Avro schema format is that all of attributes in schema which have non-keyword names are ignored by the compiler:</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">{
      &quot;namespace&quot;: &quot;test.avro&quot;,
      &quot;name&quot;: &quot;FacebookUser&quot;,
      &quot;type&quot;: &quot;record&quot;,
      &quot;fields&quot;: [
          {&quot;name&quot;: &quot;name&quot;, &quot;type&quot;: [&quot;string&quot;,null], &quot;format&quot;  : &quot;name/surname&quot; },
          {&quot;name&quot;: &quot;num_likes&quot;, &quot;type&quot;: &quot;int&quot;, &quot;min&quot; : 3},
          {&quot;name&quot;: &quot;num_photos&quot;, &quot;type&quot;: &quot;int&quot;, &quot;avg&quot; : 12},
          {&quot;name&quot;: &quot;num_groups&quot;, &quot;type&quot;: &quot;int&quot;, &quot;max&quot; : 9 } ]
}</pre></div></div>

<p>This enables us to (ab)use this information as <em>metadata</em> in a number of ways - from extending avro to a general &#8220;data modeling&#8221; language, describing interdependencies between various objects in complex systems to being a sort of &#8220;annotation&#8221; that can support &#8220;high-order&#8221; data processing. However, these &#8220;metadata&#8221; attributes are not stored in the schema serialized along with the data, which limits their &#8220;runtime&#8221; potential - for example as &#8220;annotations&#8221; for resolving complex versioning/data migration issues. </p>
<p><em>(a sequel post will follow)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/bigdata/apache-avro-in-practice/feed/</wfw:commentRss>
		</item>
		<item>
		<title>battery etc.</title>
		<link>http://blog.voidsearch.com/art/battery-etc/</link>
		<comments>http://blog.voidsearch.com/art/battery-etc/#comments</comments>
		<pubDate>Sat, 01 May 2010 22:15:40 +0000</pubDate>
		<dc:creator>Aleksandar Bradic</dc:creator>
		
		<category><![CDATA[art]]></category>

		<category><![CDATA[processing]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=335</guid>
		<description><![CDATA[
]]></description>
			<content:encoded><![CDATA[<p><img src="http://blog.voidsearch.com/wp-content/uploads/2010/05/battery_time2.jpg" alt="battery_time2" title="battery_time2" width="800" height="323" class="alignnone size-full wp-image-334" /></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/art/battery-etc/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Sampling the Social Graph using Facebook Graph API</title>
		<link>http://blog.voidsearch.com/statistics/sampling-the-social-graph-using-facebook-graph-api/</link>
		<comments>http://blog.voidsearch.com/statistics/sampling-the-social-graph-using-facebook-graph-api/#comments</comments>
		<pubDate>Sun, 25 Apr 2010 21:16:25 +0000</pubDate>
		<dc:creator>Aleksandar Bradic</dc:creator>
		
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=174</guid>
		<description><![CDATA[Recently introduced Facebook Graph API represents an interesting source of data with a nice easy-to-appreciate context to it (everyone loves social). In order to motivate some of the examples in the blog, I have written up a simple quick&#38;dirty Graph API client in Java :
http://github.com/voidsearch/voidbase/tree/master/src/main/java/com/voidsearch/data/provider/facebook/
that provides trivial-to-use interface for graph data processing :

SimpleGraphAPIClient client = [...]]]></description>
			<content:encoded><![CDATA[<p>Recently introduced <a href="http://developers.facebook.com/docs/api">Facebook Graph API</a> represents an interesting source of data with a nice easy-to-appreciate context to it (everyone loves social). In order to motivate some of the examples in the blog, I have written up a simple quick&amp;dirty Graph API client in Java :</p>
<p><a href="http://github.com/voidsearch/voidbase/tree/master/src/main/java/com/voidsearch/data/provider/facebook/">http://github.com/voidsearch/voidbase/tree/master/src/main/java/com/voidsearch/data/provider/facebook/</a></p>
<p>that provides trivial-to-use interface for graph data processing :</p>

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">SimpleGraphAPIClient client = new SimpleGraphAPIClient(fbToken);
LinkedList&lt;FacebookUser&gt; friends = client.getFriends();
for (FacebookUser friend : friends) {
        LinkedList&lt;LikedEntry&gt; likes = client.getLikes(friend.getID());
        LinkedList&lt;PhotoEntry&gt; photos = client.getPhotos(friend.getID());
        LinkedList&lt;GroupEntry&gt; groups = client.getGroups(friend.getID());
        // arbitrary dataset creation logic
}</pre></div></div>

<p>From a pure tool-perspective (without actually having an active fb app) - the dataset that can be generated is quite limited (bounded to the &#8220;neighborhood&#8221; of single user) - but even with that, a lot of interesting &#8220;play&#8221; data can be derived. For example, at minimum, we can get a <em>(num_likes, num_photos, num_groups)</em> data for all &#8220;friend&#8221; users and to that we can add some &#8220;derived&#8221; metrics like average group size, photo age, etc. Modeling this data alone can motivate some very interesting problems.</p>
<p>Here is a simple plot of <em>(num_likes, num_photos, num_groups) </em>dataset of 170 anonymous users obtained in this manner:</p>
<p><em><img class="alignnone size-full wp-image-216" title="sample_fb_dataset" src="http://blog.voidsearch.com/wp-content/uploads/2010/04/sample_fb_dataset.jpg" alt="sample_fb_dataset" width="580" height="455" /><br />
</em></p>
<p><em>(Note - some of the data that Graph API returns occassionaly doesn&#8217;t match actual state on the site - so some outliers might be just missing data on fb side. However, this (systematic bias) is what might make the dataset especially interesting </em> <img src='http://blog.voidsearch.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> <em>)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/statistics/sampling-the-social-graph-using-facebook-graph-api/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Nonparametric regression using R</title>
		<link>http://blog.voidsearch.com/finance/nonparametric-regression-using-r/</link>
		<comments>http://blog.voidsearch.com/finance/nonparametric-regression-using-r/#comments</comments>
		<pubDate>Sat, 24 Apr 2010 17:57:48 +0000</pubDate>
		<dc:creator>Aleksandar Bradic</dc:creator>
		
		<category><![CDATA[finance]]></category>

		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=140</guid>
		<description><![CDATA[
Nonparametric regression aims at modeling relation between predictors and dependent variable without any assumptions on specific form of the dependency function:

Unlike classical linear regression, where we the goal is determining parameters of assumed linear function, with nonparametric regression, the goal is estimating the entire regression function directly. Depending on the assumptions on the structure of [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignnone size-full wp-image-107" title="screen-shot-2010-04-24-at-11659-pm" src="http://blog.voidsearch.com/wp-content/uploads/2010/04/screen-shot-2010-04-24-at-11659-pm.png" alt="screen-shot-2010-04-24-at-11659-pm" width="540" height="82" /></p>
<p style="text-align: left;"><em>Nonparametric regression</em> aims at modeling relation between predictors and dependent variable without any assumptions on specific form of the dependency function:</p>
<p><img src='http://s.wordpress.com/latex.php?latex=E%28y_i%29%20%3D%20f%28x_%7B1i%7D..%2Cx_%7Bpi%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='E(y_i) = f(x_{1i}..,x_{pi})' title='E(y_i) = f(x_{1i}..,x_{pi})' class='latex' /></p>
<p style="text-align: left;">Unlike classical linear regression, where we the goal is determining parameters of assumed linear function, with nonparametric regression, the goal is estimating the entire regression function directly. Depending on the assumptions on the structure of underlying data, a number of methods exist that achieve optimality of estimation. We give a overview of several methods and explain their practical usage in R. In doing so, we make use of the social graph data described in recent <a href="http://blog.voidsearch.com/statistics/sampling-the-social-graph-using-facebook-graph-api/">post</a>.</p>
<p style="text-align: left;"><em>Local Regression</em></p>
<p style="text-align: left;">LOWESS (<em>Locally Weighted Scatterplot Smoothing</em>) algorithm is based on idea of <em>local</em> linear regression. The general approach of local regression is fitting simple models to &#8220;local&#8221; subsets of data and combining the results to determine the regression function for entire dataset. In this this method, for modelling &#8220;local&#8221; data we use weighted least squares polynomial fit of general form :</p>
<p><img src='http://s.wordpress.com/latex.php?latex=y_i%20%3D%20a%20%2B%20b_1%28x_i%20-%20x_0%29%20%2B%20b_2%28x_i-x_0%29%5E2%20%2B..%2B%20b_p%28x_i%20-%20x_0%29%5Ep%20%2B%20e_i%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_i = a + b_1(x_i - x_0) + b_2(x_i-x_0)^2 +..+ b_p(x_i - x_0)^p + e_i  ' title='y_i = a + b_1(x_i - x_0) + b_2(x_i-x_0)^2 +..+ b_p(x_i - x_0)^p + e_i  ' class='latex' /></p>
<p>where the <img src='http://s.wordpress.com/latex.php?latex=p%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p ' title='p ' class='latex' /> &#8220;local&#8221; observations are weighted by their proximity to &#8220;focal&#8221; value <img src='http://s.wordpress.com/latex.php?latex=x_0%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_0 ' title='x_0 ' class='latex' /> ..</p>
<p style="text-align: left;"><img class="alignnone size-full wp-image-237" title="groups_likes_lowess" src="http://blog.voidsearch.com/wp-content/uploads/2010/04/groups_likes_lowess.jpg" alt="groups_likes_lowess" width="299" height="291" /></p>
<p style="text-align: left;">

<div class="wp_syntax"><div class="code"><pre class="language" style="font-family:monospace;">&gt; plot(num_groups, num_likes)
&gt; lines(lowess(num_groups ~ num_likes,  f = 2/3, iter=4),col = 2)</pre></div></div>

<p style="text-align: left;">The effect of span window for (f=1/16, f=1/8, f=1/4, f=1/2 ) :</p>
<p style="text-align: left;"><img class="alignnone size-full wp-image-242" title="lowess_span_effect" src="http://blog.voidsearch.com/wp-content/uploads/2010/04/lowess_span_effect.jpg" alt="lowess_span_effect" width="430" height="406" /></p>
<p style="text-align: left;"><em>(in progress&#8230;)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/finance/nonparametric-regression-using-r/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Towards voidbase v0.1</title>
		<link>http://blog.voidsearch.com/voidbase/towards-voidbase-v01/</link>
		<comments>http://blog.voidsearch.com/voidbase/towards-voidbase-v01/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 23:56:27 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[voidbase]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=67</guid>
		<description><![CDATA[As time moves on, we&#8217;re finally shaping up to to move towards the v0.1 of the framework. Some of the essential features of this release include :

voidbase console
configurable ui
ui save sessions
high-performance/concurrency stream processing
appropriate benchmarks
full documentation
high-level operations and fork queues (q2 &#60;- loglinear(q1) - style)
plugin mechanism for writing custom high-level operations
set of recepies for common use [...]]]></description>
			<content:encoded><![CDATA[<p>As time moves on, we&#8217;re finally shaping up to to move towards the v0.1 of the framework. Some of the essential features of this release include :</p>
<ul>
<li>voidbase console</li>
<li>configurable ui</li>
<li>ui save sessions</li>
<li>high-performance/concurrency stream processing</li>
<li>appropriate benchmarks</li>
<li>full documentation</li>
<li>high-level operations and fork queues (q2 &lt;- loglinear(q1) - style)</li>
<li>plugin mechanism for writing custom high-level operations</li>
<li>set of recepies for common use cases (web traffic monitoring, log processing, data stream aggregation..)</li>
<li>draft of c++ - based native queue storage</li>
<li>draft of quantlib integration</li>
<li>set of simple high-level machine learning operations</li>
<li>sample integration with reuters datafeed</li>
<li>draft of first version of parallelization platform</li>
<li>extension of ui capabilities</li>
<li>queue persistence / save sessions</li>
<li>transparent feed proxy support</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/voidbase/towards-voidbase-v01/feed/</wfw:commentRss>
		</item>
		<item>
		<title>A call for indie software development</title>
		<link>http://blog.voidsearch.com/software-design/a-call-for-indie-software-development/</link>
		<comments>http://blog.voidsearch.com/software-design/a-call-for-indie-software-development/#comments</comments>
		<pubDate>Wed, 04 Nov 2009 19:26:38 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[software design]]></category>

		<guid isPermaLink="false">http://blog.voidsearch.com/?p=65</guid>
		<description><![CDATA[Years of open soruce software development, aimed at perfecting the engineering craft and providing alternative and resources for new developers to learn, though sucessfull mission, have failed to provide side effects that some of us hoped for. The hope was that the open software would enable to the many to express the creativity, and start [...]]]></description>
			<content:encoded><![CDATA[<p>Years of open soruce software development, aimed at perfecting the engineering craft and providing alternative and resources for new developers to learn, though sucessfull mission, have failed to provide side effects that some of us hoped for. The hope was that the open software would enable to the many to express the creativity, and start creating new qualities of software, exploring new territories, and moving the experience of working on building software, from craft to something more closely related to an art.</p>
<p>If we observe the similar process, as happening in music, in the last century, is exactly that by providing tools for creating music available to many - it became possible for many to create new qualities, regardless of their skillset, as long as they can formulate a long-lasting &#8220;art&#8221; aspect of it. A lot of indie music today is aimed at creating something &#8220;new&#8221; rather than creating somehting &#8220;perfect&#8221;. Even though a lot of this would end up on the dumpster, the long-term effect is explosion of the field it that more and more of &#8220;new terittories&#8221; are being explored, and every once a while a flags have been raised. This makes all of our lives richer and helps us advance, which was the meaning of it all.</p>
<p>Naturally, a call for a similar initiative in software is needed - a call for creating a software aimed at creating something &#8220;new&#8221; - rather than creating something &#8220;better&#8221;, &#8220;cheaper&#8221;  - a call for creating something with art-like qualities to it , with the sense of individual expression and exploration.</p>
<p>&#8230; to be continued</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.voidsearch.com/software-design/a-call-for-indie-software-development/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>

