Fouille de données sur mes amis facebookData mining on my facebook friends
Uniquement en anglais…
This post is dedicated to my students for whom I put a great deal of efforts into trying to push them toward the fantastic R world… I’ve found an unlimited source of applications into crawling my facebook network. This post explains how to gather together data coming from facebook and gives a few hints about how to analyze your friends’ mutual friendship network and main interests.
Collecting the data from facebook
To do so, you first need to:
- log in to your facebook account and get an access token on the facebook graph API (don’t forget to ask extended permissions to be able to collect your friends’ data)
-
use the function
extract.FBNet
provided at the end of this post to extract your friends’ mutual friendship network and main interests; this can be done by the following R command lines:token = "AAACEdEose0c..." # paste you res = extract.FBNet(token)
Then
res$network
is the network andres$info
contains your friends’ likes, favorite music, movies and books. Further information about what can be collected using the facebook token is provided here (for instance, you can also collect your friends’ last posts and comments).</li> </ul>What kind of music do my friends listen to?
res$info
is a list having for length the number of my friends. Each element of the list contains-
id
the facebook identifier; -
name
the friend’s name; -
music
the friend’s favorite musicians; -
movies
the friend’s favorite movies; -
books
the friend’s favorite books; -
likes
the friend’s “like” tags.
In the following, I study my friends’ favorite music (the last command is from the package ggplot2):
# Collect favorite musicians for each friends from info all.music = lapply(res$info, function(x) x$music) length(unique(unlist(all.music))) # 811 different values sum(table(unlist(all.music))==1) # 83 musicians are only cited once # Let's see which musicians are the most popular music.freq = sort(table(unlist(all.music)), decreasing=T) best.music.freq = names(music.freq)[music.freq>4] best.music = substr((unlist(all.music))[unlist(all.music)%in%best.music.freq],1,15) # to shorten the names num = as.numeric(as.factor(best.music)) best.music = data.frame("music"=best.music,"id"=factor(num)) qplot(id, data=best.music, geom="bar", fill=music)+labs(title="My friends' favorite music", xlab="")
which gives the following chart
OK, so Vincent G. it seems that you spoiled these data… Also, it’s so very interesting to note that 8 of my friends have “Parce que nos plus belles conneries deviennent nos plus beaux souvenirs. =)” as one of their favorite music (I don’t translate it in English, but surely GIYF). Btw, I have the names… shame on you, guys!At this point, I was very disappointed that The Cure were not in the most popular musicians: what kind of foolish friends do I have?! I finally checked it directly:
V(res$network)$name[unlist(lapply(all.music,function(x) length(grep("Cure", x)) != 0))]
and luckily found out that Matthieu V and Kevin M can come with me to the next gig.
Finally, let’s see who is having the largest number of favorite musicians in her/his profile.
# Number of favorite musicians for each friend music.addict = unlist(lapply(all.music, function(x) length(x))) summary(music.addict) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.000 0.000 2.000 7.211 6.500 84.000 # Who is having the largest number of favorite musicians in her/his profile? head(sort(music.addict,decreasing=T)) # [1] 84 81 79 71 55 45 V(res$network)[music.addict%in%c(84,81,79,71,55,45)] # Vertex sequence: # [1] "Matthieu V" "fabien P" "Clement D" "Paul C" # [5] "Abou E" "Alexia A"
One fourth of my friends have no favorite musician on their profile but… Alexia A, tell me, how can you have 84 favorite musicians? 😉
The same analysis with books gives:
summary(book.addict) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.0000 0.0000 0.0000 0.7415 0.0000 22.0000
More than half of my friends have no favorite book???!! (ok so do I, at least on facebook…)
Network analysis together with basic information on favorite music and books
First have a quick glance at the network’s main characteristics (the package igraph is required to handle graph)
summary(res$network) # IGRAPH UN-- 147 547 -- # attr: id (v/c), name (v/c), initials (v/c)
which means that my network contains 147 friends with attributes id (the facebook id), name (their names) and initials (their initials). Other numerical summaries are provided by igraph, such as:
graph.density(res$network) # Number of connections between friends divided by the number of possible connections # [1] 0.05097381 transitivity(res$network) # Probability that two friends who share a common relationship, except me, are also friends # [1] 0.5661737
Some of my friends are disconnected from what is called the largest connected component (i.e., the largest subgraph that is connected; so, yes, Myriam V and Célia E, you’re not in…)
# connected component analysis connected.comp = clusters(res$network) connected.comp$csize # [1] 102 3 3 13 1 2 1 1 1 2 1 1 2 1 1 1 2 2 2 # [20] 1 1 1 1 1 # The largest connected component contains 102 friends. # Throwing away unconnected people (goodbye my dear sis'...) lcc = induced.subgraph(res$network, connected.comp$membership==1) lcc # IGRAPH UN-- 102 523 -- # + attr: id (v/c), name (v/c), initials (v/c)
Finally I display the graph with the nodes colored according to the number of favorite musicians mentioned on each profile and labeled with my friend’s initials.
the.layout = layout.fruchterman.reingold(res$network) the.colors = brewer.pal(9,"YlOrRd") v.col = the.colors[1+cut(music.addict,c(-0.1,2,quantile(music.addict,probs=seq(0.6,1,length=7))),labels=F)][match(V(lcc)$name, V(res$network)$name)] par(mar=c(0,0,0,0)) plot(lcc, layout=the.layout[match(V(lcc)$name, V(res$network)$name),], vertex.size=5, vertex.color=v.col, vertex.frame.color=v.col, vertex.label=V(lcc)$initials, vertex.label.cex=0.7, vertex.label.font=2, vertex.label.color="black")
which gives the two following charts (the red one is for music, the blue one for books… !)
Paul C., would you tell me what you’re doing except reading and listening to music?Functions to collect your friends’ data from facebook
The library RCurl, rjson, igraph are required to use these functions:
facebook = function( path = "me", access_token = token, options) { if( !missing(options) ){ options = sprintf( "?%s", paste( names(options), "=", unlist(options), collapse = "&", sep = "" ) ) } else { options = "" } data = getURL( sprintf( "https://graph.facebook.com/%s%s&access_token=%s", path, options, access_token ) ) fromJSON( data ) } extract.FBNet = function(token) { # outputs: igraph network ("network") and information on friends ("info" which is a list) # first, gather friends' list friends = facebook(path="me/friends", access_token=token) # basic friends' description friends.id = sapply(friends$data, function(x) x$id) # extract names friends.name = sapply(friends$data, function(x) iconv(x$name,"UTF-8","ASCII//TRANSLIT")) # short names to initials initials = function(x) {paste(substr(x,1,1), collapse="")} friends.initial = sapply(strsplit(friends.name," "), initials) # final data frame friends = data.frame("id"=friends.id, "name"=friends.name, "initial"=friends.initial, stringsAsFactors = FALSE) # Information on friends friends.info = list() for (ind in 1:length(friends.id)) { print(paste("information for friend number",ind,"...")) friends.info[[ind]] = list() friends.info[[ind]]$id = friends$id[ind] friends.info[[ind]]$name = friends$name[ind] tmp = facebook(path=paste(friends$id[ind],"/likes",sep="")) friends.info[[ind]]$likes = unique(unlist(lapply(tmp$data, function(x) x$name))) tmp = facebook(path=paste(friends$id[ind],"/books",sep="")) friends.info[[ind]]$books = unique(unlist(lapply(tmp$data, function(x) x$name))) tmp = facebook(path=paste(friends$id[ind],"/music",sep="")) friends.info[[ind]]$music = unique(unlist(lapply(tmp$data, function(x) x$name))) tmp = facebook(path=paste(friends$id[ind],"/movies",sep="")) friends.info[[ind]]$movies = unique(unlist(lapply(tmp$data, function(x) x$name))) } # friendship relation matrix N = length(friends.id) friendship.matrix = matrix(0,N,N) for (i in 1:N) { # For each friend, find the mutual friends to add edges to the graph tmp = facebook(path=paste("me/mutualfriends", friends.id[i], sep="/") , access_token=token) mutualfriends = sapply(tmp$data, function(x) x$id) friendship.matrix[i,friends.id %in% mutualfriends] = 1 } colnames(friendship.matrix) = friends.id rownames(friendship.matrix) = friends.name mygraph = graph.adjacency(friendship.matrix,mode="undirected",add.colnames="id",add.rownames="name") V(mygraph)$initials = friends$initial list("network"=mygraph, "info"=friends.info) }
-