Archive for April, 2010

Sampling the Social Graph using Facebook Graph API

Sunday, April 25th, 2010

Recently introduced Facebook Graph API represents an interesting source of data with a nice easy-to-appreciate context to it (everyone loves social). In order to motivate some of the examples in the blog, I have written up a simple quick&dirty Graph API client in Java :

http://github.com/voidsearch/voidbase/tree/master/src/main/java/com/voidsearch/data/provider/facebook/

that provides trivial-to-use interface for graph data processing :

SimpleGraphAPIClient client = new SimpleGraphAPIClient(fbToken);
LinkedList<FacebookUser> friends = client.getFriends();
for (FacebookUser friend : friends) {
        LinkedList<LikedEntry> likes = client.getLikes(friend.getID());
        LinkedList<PhotoEntry> photos = client.getPhotos(friend.getID());
        LinkedList<GroupEntry> groups = client.getGroups(friend.getID());
        // arbitrary dataset creation logic
}

From a pure tool-perspective (without actually having an active fb app) - the dataset that can be generated is quite limited (bounded to the “neighborhood” of single user) - but even with that, a lot of interesting “play” data can be derived. For example, at minimum, we can get a (num_likes, num_photos, num_groups) data for all “friend” users and to that we can add some “derived” metrics like average group size, photo age, etc. Modeling this data alone can motivate some very interesting problems.

Here is a simple plot of (num_likes, num_photos, num_groups) dataset of 170 anonymous users obtained in this manner:

sample_fb_dataset

(Note - some of the data that Graph API returns occassionaly doesn’t match actual state on the site - so some outliers might be just missing data on fb side. However, this (systematic bias) is what might make the dataset especially interesting :) )

Nonparametric regression using R

Saturday, April 24th, 2010

screen-shot-2010-04-24-at-11659-pm

Nonparametric regression aims at modeling relation between predictors and dependent variable without any assumptions on specific form of the dependency function:

E(y_i) = f(x_{1i}..,x_{pi})

Unlike classical linear regression, where we the goal is determining parameters of assumed linear function, with nonparametric regression, the goal is estimating the entire regression function directly. Depending on the assumptions on the structure of underlying data, a number of methods exist that achieve optimality of estimation. We give a overview of several methods and explain their practical usage in R. In doing so, we make use of the social graph data described in recent post.

Local Regression

LOWESS (Locally Weighted Scatterplot Smoothing) algorithm is based on idea of local linear regression. The general approach of local regression is fitting simple models to “local” subsets of data and combining the results to determine the regression function for entire dataset. In this this method, for modelling “local” data we use weighted least squares polynomial fit of general form :

y_i = a + b_1(x_i - x_0) + b_2(x_i-x_0)^2 +..+ b_p(x_i - x_0)^p + e_i

where the p “local” observations are weighted by their proximity to “focal” value x_0 ..

groups_likes_lowess

> plot(num_groups, num_likes)
> lines(lowess(num_groups ~ num_likes,  f = 2/3, iter=4),col = 2)

The effect of span window for (f=1/16, f=1/8, f=1/4, f=1/2 ) :

lowess_span_effect

(in progress…)