igraph and structured text exploration

I am in the slow process of developing a package to bridge structured text formats (i.e. classroom transcripts)  with the tons of great R packages that visualize and analyze quantitative data (If you care to play with a rough build of this package (qdap) see: https://github.com/trinker/qdap). One of the packages qdap will bridge to is igraph.

A while back I came across a blog post on igraph and word statistics (LINK).  It inspired me to learn a little bit about graphing and the igraph package and provided a nice intro to learn.  As I play with this terrific package I feel it is my duty to share my experiences with others who are just starting out with igraph as well.   The following post is a script and the plots created with a word frequency matrix (similar to a term document matrix from the tm package) and igraph:

Build a word frequency matrix and covert to an adjacency matrix

X <- matrix(rpois(100, 1), 10, 10)
colnames(X) <- paste0("Guy_", 1:10)
rownames(X) <- c('The', 'quick', 'brown', 'fox', 'jumps',
    'over', 'a', 'bot', 'named', 'Dason')
X #word frequency matrix
Y <- X >= 1
Y <- apply(Y, 2, as, "numeric") #boolean matrix
rownames(Y) <- rownames(X)
Z <- t(Y) %*% Y  #adjacency matrix

Build a graph from the above matrix

 g <- graph.adjacency(Z, weighted=TRUE, mode ='undirected')
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

#Plot a Graph
layout1 <- layout.auto(g)
#for more on layout see:
opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room
plot(g, layout=layout1)

Alter widths of edges based on dissimilarity of people’s dialogue

 #adjust the widths of the edges and add distance measure labels
#use 1 - binary (?dist) a proportion distance of two vectors
#1 is perfect and 0 is no overlap (using 1 - binary)

edge.weight <- 7  #a maximizing thickness constant
z1 <- edge.weight*(1-dist(t(X), method="binary"))
E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge
z2 <- round(1-dist(t(X), method="binary"), 2)
E(g)$label <- c(z2)[c(z2) != 0]
plot(g, layout=layout1) #check it out! 

Scale the label cex based on word counts

 SUMS <- diag(Z) #frequency (same as colSums(X))
label.size <- .5 #a maximizing label size constant
V(g)$label.cex <- (log(SUMS)/max(log(SUMS))) + label.size
plot(g, layout=layout1) #check it out!

Add vertex coloring based on factoring

 #add factor information via vertex color
V(g)$gender <- rbinom(10, 1, .4)
V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue")

plot(g, layout=layout1) #check it out!
plot(g, layout=layout1, edge.curved = TRUE) #curve it up

par(mar=opar) #reset margins 

Try it interactively with tkplot

#interactive version
tkplot(g)  #an interactive version of the graph
tkplot(g, edge.curved =TRUE) 

This is just scratching the surface of igraph’s capabilities. Click here for a link to more igraph documentation.

This post was me toying with different ideas and concepts. If you see a way to improve the code/thinking please leave a comment.

For a .txt version of this demonstration click here


About tylerrinker

Data Scientist, open-source developer , #rstats enthusiast, #dataviz geek, and #nlp buff
This entry was posted in igraph, text and tagged , , , , . Bookmark the permalink.

12 Responses to igraph and structured text exploration

  1. andrew clark says:

    Just getting into igraph myself and having some probs with code. Presumably line 3 should not have a 0 after paste. More importantly, when I run code, Z appears to be a 5×10 matrix with the bottom half of the rows. Not a square matrix so subsequent graph.adjacency function (which presumably should come after attaching igraph package) does not work

  2. tylerrinker says:

    Hi Andrew,
    igraph is pretty awesome for all sorts of tasks. You have 2 concerns 1) the 0 behond the paste is not a typo but a function in R version 2.15 (I think) this has available which is the same as paste(…, sep =””); see: http://stat.ethz.ch/R-manual/R-devel/library/base/html/paste.html. The concern with the Z being a 5 x 10 is not correct. On my machine in vanilla I get: https://dl.dropbox.com/u/61803503/output.txt

  3. andrew clark says:

    Thanks for getting back so swiftly. I was using R 2.14 so updated , which got round the paste0 issue
    Here is my output https://dl-web.dropbox.com/get/myOutput.txt?w=4ffd94f8
    Hopefully, you can see where i am going wrong

    Thanks for heads up on the other blog. i will check iy out

    • tylerrinker says:

      Andrew, I am unable to see your output. If you want to share things via dropbox (awesome feature IMHO) then you have to put the file you’re sharing in the directory marked Public.

  4. tylerrinker says:

    Andrew try running a clean session. Something’s not operating correctly. I run exactly what you have uploaded and get a very different result. Somehow you’re losing half the matrix Y (the Boolean matrix. I suspect you’ve written over the function t or %*% because even with a 5 row x 10 column matrix the t(Y[-c(1:5),]) %*% Y[-c(1:5),] is still a 10 x 10 matrix.

    • andrew clark says:

      I think it’s solved. When you mentioned function t, I recalled that way back when I was experimenting with Rprofile I had used t as shortcut for tail function. Rarely used it in practice and never come across another function called t before . Deleted the code and rerun seems to work fine. thanks again for taking time to help out

  5. Pingback: …a scientific crowd | FreshBiostats

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s