I am in the slow process of developing a package to bridge structured text formats (i.e. classroom transcripts) with the tons of great R packages that visualize and analyze quantitative data (If you care to play with a rough build of this package (qdap) see: https://github.com/trinker/qdap). One of the packages qdap will bridge to is igraph.
A while back I came across a blog post on igraph and word statistics (LINK). It inspired me to learn a little bit about graphing and the igraph package and provided a nice intro to learn. As I play with this terrific package I feel it is my duty to share my experiences with others who are just starting out with igraph as well. The following post is a script and the plots created with a word frequency matrix (similar to a term document matrix from the tm package) and igraph:
Build a word frequency matrix and covert to an adjacency matrix
set.seed(10) X <- matrix(rpois(100, 1), 10, 10) colnames(X) <- paste0("Guy_", 1:10) rownames(X) <- c('The', 'quick', 'brown', 'fox', 'jumps', 'over', 'a', 'bot', 'named', 'Dason') X #word frequency matrix Y <- X >= 1 Y <- apply(Y, 2, as, "numeric") #boolean matrix rownames(Y) <- rownames(X) Z <- t(Y) %*% Y #adjacency matrix
Build a graph from the above matrix
g <- graph.adjacency(Z, weighted=TRUE, mode ='undirected') # remove loops library(igraph) g <- simplify(g) # set labels and degrees of vertices V(g)$label <- V(g)$name V(g)$degree <- degree(g) #Plot a Graph set.seed(3952) layout1 <- layout.auto(g) #for more on layout see: browseURL("http://finzi.psych.upenn.edu/R/library/igraph/html/layout.html") opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room plot(g, layout=layout1)
Alter widths of edges based on dissimilarity of people’s dialogue
#adjust the widths of the edges and add distance measure labels #use 1 - binary (?dist) a proportion distance of two vectors #1 is perfect and 0 is no overlap (using 1 - binary) edge.weight <- 7 #a maximizing thickness constant z1 <- edge.weight*(1-dist(t(X), method="binary")) E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge z2 <- round(1-dist(t(X), method="binary"), 2) E(g)$label <- c(z2)[c(z2) != 0] plot(g, layout=layout1) #check it out!
Scale the label cex based on word counts
SUMS <- diag(Z) #frequency (same as colSums(X)) label.size <- .5 #a maximizing label size constant V(g)$label.cex <- (log(SUMS)/max(log(SUMS))) + label.size plot(g, layout=layout1) #check it out!
Add vertex coloring based on factoring
#add factor information via vertex color set.seed(15) V(g)$gender <- rbinom(10, 1, .4) V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue") plot(g, layout=layout1) #check it out! plot(g, layout=layout1, edge.curved = TRUE) #curve it up par(mar=opar) #reset margins
Try it interactively with tkplot
#interactive version tkplot(g) #an interactive version of the graph tkplot(g, edge.curved =TRUE)
This is just scratching the surface of igraph’s capabilities. Click here for a link to more igraph documentation.
This post was me toying with different ideas and concepts. If you see a way to improve the code/thinking please leave a comment.
Hi,
Just getting into igraph myself and having some probs with code. Presumably line 3 should not have a 0 after paste. More importantly, when I run code, Z appears to be a 5×10 matrix with the bottom half of the rows. Not a square matrix so subsequent graph.adjacency function (which presumably should come after attaching igraph package) does not work
Hi Andrew,
igraph is pretty awesome for all sorts of tasks. You have 2 concerns 1) the 0 behond the paste is not a typo but a function in R version 2.15 (I think) this has available which is the same as paste(…, sep =””); see: http://stat.ethz.ch/R-manual/R-devel/library/base/html/paste.html. The concern with the Z being a 5 x 10 is not correct. On my machine in vanilla I get: https://dl.dropbox.com/u/61803503/output.txt
Also take a look at this blog post: http://www.babelgraph.org/wp/?p=1 which has some pretty cool examples.
Tyler,
Thanks for getting back so swiftly. I was using R 2.14 so updated , which got round the paste0 issue
Here is my output https://dl-web.dropbox.com/get/myOutput.txt?w=4ffd94f8
Hopefully, you can see where i am going wrong
Thanks for heads up on the other blog. i will check iy out
Andrew, I am unable to see your output. If you want to share things via dropbox (awesome feature IMHO) then you have to put the file you’re sharing in the directory marked Public.
Whoops. about a year since i used dropbox. try this https://dl-web.dropbox.com/get/Public/myOutput.txt?w=0291b8c4
Andrew try going into your folder on your actual computer and right click. Select copy public link and post that here. Unfortunately, the link to the out put is still restricted.
try this.
https://dl.dropbox.com/u/25945599/myOutput.txt
I can access this via browser
Andrew try running a clean session. Something’s not operating correctly. I run exactly what you have uploaded and get a very different result. Somehow you’re losing half the matrix Y (the Boolean matrix. I suspect you’ve written over the function t or %*% because even with a 5 row x 10 column matrix the t(Y[-c(1:5),]) %*% Y[-c(1:5),] is still a 10 x 10 matrix.
I think it’s solved. When you mentioned function t, I recalled that way back when I was experimenting with Rprofile I had used t as shortcut for tail function. Rarely used it in practice and never come across another function called t before . Deleted the code and rerun seems to work fine. thanks again for taking time to help out
My pleasure, when you create something with igraph be sure to post a blog and share.
Pingback: …a scientific crowd | FreshBiostats