Archive for August, 2011

Simple Twitter Stream Processing using Sed

Thursday, August 18th, 2011

With the broader availability of HTTP Server Push API’s such as Twitter Streaming API creating data streams and piping them to regular Unix processes has become trivial, which provides opportunities for playful interactions with the rest of Unix toolbox. Tools such as sed & awk are particularly well suited for this task and they have been part of the standard distribution since Unix Version 7 (1979). Even more so, these tools are actually designed for stream-based processing (though with different “streams” in mind), so it is interesting to explore what good can they still do for us in 2011+ :)

We give some simple examples of operations on data streams using sed in combination with the Twitter Spritzer Feed :

1. Get text of all tweets in the stream :

curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length' -u USER:PASS \
| sed -n 's/.*"text":\("[^"]*"\).*/\1/p'
"@isay_dayo chillin.. juss chillin"
"@BoeBoeThoe hahhaha followed by lil wayne -get too comfortable"
"On to the next ...."

2. Get verbose print of all tweets in the stream:

curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length' -u USER:PASS \
| sed -n 's/.*"retweet_count":\([^"]*\),.*"text":\("[^"]*"\).*"created_at":"\([^"]*\)".*"
screen_name":"\([^"]*\)".*"time_zone":"\([^"]*\)".*/\3 | \4 (\5) | \2 (\1 retweets)/p'
Wed Jun 24 08:29:08 +0000 2009 agirprlaplanete (Paris) | "Fukushima : contamination marine et 
silence du gouvernement http:\/\/t.co\/rcTwdXk" (4 retweets)
Thu Apr 16 18:06:15 +0000 2009 rjmoeller (Central Time (US & Canada)) | "This is funny, I 
don't care where you're from: http:\/\/t.co\/6yo6ACj" (1 retweets)

3. Get US-only tweets with retweet count > 5 :

curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
| sed -n 's/.*"retweet_count":\([5-9][0-9]*\),.*"text":\("[^"]*"\).*"created_at":"\([^"]*\)".*
"screen_name":"\([^"]*\)".*"time_zone":"\([^"]*\) (US & Canada)[^"]*".*/\3 | \4 (\5) | \2 (\1 
retweets)/p'
Thu Mar 15 09:26:10 +0000 2007 | sfslim (Pacific Time) | "Lesson learned? <90 people can 
paralyze a city transit system merely by leveraging the reputation of Anonymous. 
Fascinating\u2026 #PsyOps #OpBART" (75 retweets)
Wed Jun 22 01:09:25 +0000 2011 | _BlackStewie (Eastern Time) | "A Wise Hoodrat once said.. \"
(66 retweets)

4. Get all the http links appearing in tweets :

curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
|sed -n 's/.*"text":\("[^"]*http:\\[\/]*\([^"]*\)\\\/\([^" ]*\)"\).*/http:\/\/\2\/\3/p'
http://t.co/dvtPgcG
http://de.tk/0ijcS
http://t.co/M4mKFV8
http://t.co/KftSofI

5. Get all the hashtags in tweets :

curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
| sed -n 's/.*"text":\("[^"]*\(#[^" ]*\)"\).*/\2/p'
#ojkbot
#worstfeeling
#raganswa
#AngerOnAuto
#Nostalgia

(note that these are just ad hoc ideas and not tested in great detail)

Now, once we complete creation of such Sed-filtered streams, we can hook these up to the rest of standard Unix tools, and that’s where the real fun begins … :)

news::visualized | take #02

Monday, August 15th, 2011

A bit more of Sprawl Voice concept design :

.

newswire_02_resized

.

news::visualized | take #01

Saturday, August 13th, 2011

A piece of concept design for something we have been playing with as a part of our (upcoming) Sprawl Voice project :

newswire_infographics1_resized2

stay tuned …