Simple Twitter Stream Processing using Sed
Thursday, August 18th, 2011With the broader availability of HTTP Server Push API’s such as Twitter Streaming API creating data streams and piping them to regular Unix processes has become trivial, which provides opportunities for playful interactions with the rest of Unix toolbox. Tools such as sed & awk are particularly well suited for this task and they have been part of the standard distribution since Unix Version 7 (1979). Even more so, these tools are actually designed for stream-based processing (though with different “streams” in mind), so it is interesting to explore what good can they still do for us in 2011+
We give some simple examples of operations on data streams using sed in combination with the Twitter Spritzer Feed :
1. Get text of all tweets in the stream :
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length' -u USER:PASS \
| sed -n 's/.*"text":\("[^"]*"\).*/\1/p'"@isay_dayo chillin.. juss chillin" "@BoeBoeThoe hahhaha followed by lil wayne -get too comfortable" "On to the next ...."
2. Get verbose print of all tweets in the stream:
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length' -u USER:PASS \
| sed -n 's/.*"retweet_count":\([^"]*\),.*"text":\("[^"]*"\).*"created_at":"\([^"]*\)".*"
screen_name":"\([^"]*\)".*"time_zone":"\([^"]*\)".*/\3 | \4 (\5) | \2 (\1 retweets)/p'Wed Jun 24 08:29:08 +0000 2009 agirprlaplanete (Paris) | "Fukushima : contamination marine et silence du gouvernement http:\/\/t.co\/rcTwdXk" (4 retweets) Thu Apr 16 18:06:15 +0000 2009 rjmoeller (Central Time (US & Canada)) | "This is funny, I don't care where you're from: http:\/\/t.co\/6yo6ACj" (1 retweets)
3. Get US-only tweets with retweet count > 5 :
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
| sed -n 's/.*"retweet_count":\([5-9][0-9]*\),.*"text":\("[^"]*"\).*"created_at":"\([^"]*\)".*
"screen_name":"\([^"]*\)".*"time_zone":"\([^"]*\) (US & Canada)[^"]*".*/\3 | \4 (\5) | \2 (\1
retweets)/p'Thu Mar 15 09:26:10 +0000 2007 | sfslim (Pacific Time) | "Lesson learned? <90 people can paralyze a city transit system merely by leveraging the reputation of Anonymous. Fascinating\u2026 #PsyOps #OpBART" (75 retweets) Wed Jun 22 01:09:25 +0000 2011 | _BlackStewie (Eastern Time) | "A Wise Hoodrat once said.. \" (66 retweets)
4. Get all the http links appearing in tweets :
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
|sed -n 's/.*"text":\("[^"]*http:\\[\/]*\([^"]*\)\\\/\([^" ]*\)"\).*/http:\/\/\2\/\3/p'http://t.co/dvtPgcG http://de.tk/0ijcS http://t.co/M4mKFV8 http://t.co/KftSofI
5. Get all the hashtags in tweets :
curl 'http://stream.twitter.com/1/statuses/sample.json?delimited=length -u USER:PASS' \
| sed -n 's/.*"text":\("[^"]*\(#[^" ]*\)"\).*/\2/p'#ojkbot #worstfeeling #raganswa #AngerOnAuto #Nostalgia
(note that these are just ad hoc ideas and not tested in great detail)
Now, once we complete creation of such Sed-filtered streams, we can hook these up to the rest of standard Unix tools, and that’s where the real fun begins …

