Part III: Using TidyText to perform sentiment analysis

Let’s continue with two of our authors from the previous section: Herodotus and Livy. Now we will create a ‘words’ vector that goes through the standard tidytext process of uploading a text file, creating a dataframe of words and row numbers, and tokenizing the words in the text file.

livy <- livy %>% unnest_tokens(word,text)

write.table(livy, file = "livy.txt")

livy_words <- data_frame(file = paste0('livy.txt')) %>%
  mutate(text = map(file, read_lines)) %>%
  unnest() %>%
  group_by(file = str_sub(basename(file), 1, -5)) %>%
  mutate(line_number = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Next you invoke the ‘inner_join’ function which is essentially a way of conflating a data set against another. Here we are joining the text data from Herodotus with a dictionary of sentiment words that assigns relative values to each word.

livy_words_sentiment <- inner_join(livy_words,
                              get_sentiments("bing")) %>%
  count(file, index = round(line_number/ max(line_number) * 100 / 5) * 5, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative)

Using the ggplot library, we can visualise the results.

livy_words_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  facet_wrap(~ file) + 
  scale_x_continuous("Location in the volume") + 
  scale_y_continuous("Bing net Sentiment")

Let’s make this interesting: let’s compare these results to Gibbon.

herodotus <- herodotus %>% unnest_tokens(word,text)

write.table(herodotus, file = "herodotus.txt")

herodotus_words <- data_frame(file = paste0("herodotus.txt")) %>%
  mutate(text = map(file, read_lines)) %>%
  unnest() %>%
  group_by(file = str_sub(basename(file), 1, -5)) %>%
  mutate(line_number = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

herodotus_words_sentiment <- inner_join(herodotus_words,
                                                 get_sentiments("bing")) %>%
  count(file, index = round(line_number/ max(line_number) * 100 / 5) * 5, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative)

herodotus_words_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  facet_wrap(~ file) + 
  scale_x_continuous("Location in the volume (by percentage)") + 
  scale_y_continuous("Bing net sentiment of Herodotus's Histories...")

That’s quite a difference. Clearly the Roman histories were more interested in negative words. Let’s break down the Livy results into more understandable graphs.

bing_word_counts <- livy_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Word Frequency of Sentiment Words in Livy",
       x = NULL) +
  coord_flip()
summary(bing_word_counts)
bing_word_counts <- herodotus_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Word Frequency of Sentiment Words in Herodotus",
       x = NULL) +
  coord_flip()
summary(bing_word_counts)

Another way to re-orient the sentiment results is to create a word cloud. Sometimes these can be useful for assessing the total weight of positivity or negativity in a corpus.

library(wordcloud)
library(reshape2)

# create a sentiment wordcloud of the Livy results

livy_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 1000, scale = c(1,.25), 
                   random.order = FALSE,
                   colors = c("red", "blue"))

As you can see, the cloud displays the overall negativity that the line graph above suggested. Let’s see how that compares to Herodotus.

library(wordcloud)
library(reshape2)

# create a sentiment wordcloud of the Herodotus results

herodotus_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 1000, scale = c(1,.25), 
                   random.order = FALSE,
                   colors = c("red", "blue"))

Exercise 3

Load your own texts (either from your own corpus or from a digital repository like Perseus or Project Gutenberg).

Posit a new question–or questions–about what you would like to investigate further.

Modify a code block(s) from Part I of the R Notebook to answer your question.

Part IV: Using the XML Library to analyse editions

We can complement the results in the previous section. Below I will walk you through the analyses that guided the examination of XML files of Melville’s marginalia in his 7-volume set of Shakespeare’s plays.

First, the XML package in R. On the right-bottom pane of RStudio, check to see whether you have the XML package. If not, then click on “Install” and search for and download the library. Now, to invoke the package and run our first parsing loop over the XML.

library(XML)
doc <- xmlTreeParse("460-markings-only.xml", useInternalNodes=TRUE)
divs.ns.l <- getNodeSet(doc, "/body//div") # iterates over ('reads through') all divs in the XML file
divs.ns.l
div.freqs.l <- list() # create two list variables into which you will dump the XML data
div.raws.l <- list()

# the for loop grabs all the divs and calculates the relative frequencies among words 
for(i in 1:length(divs.ns.l)){
  div.content <- xmlValue(divs.ns.l[[i]], "div")[[1]]
  words.ns <- xmlElementsByTagName(divs.ns.l[[i]], "w", recursive = TRUE)
  div.words.v <- paste(sapply(words.ns, xmlValue), collapse=" ")
  words.lower.v <- tolower(div.words.v)
  words.l <- strsplit(words.lower.v, "[^a-z'-]")
  word.v <- unlist(words.l)
  word.v <- word.v[which(word.v!="")]
  div.freqs.t <- table(word.v)
  div.raws.l[[div.content]] <- div.freqs.t
  div.freqs.l[[div.content]] <- 100*(div.freqs.t/sum(div.freqs.t))
}
div.freqs.un.v <- unlist(div.freqs.l)
sorted.div.freqs.v <- sort(div.freqs.un.v, decreasing=T)

# just for your records, this creates a spreadsheet of all your results
write.table(sorted.div.freqs.v, "460.freqs.v.sorted.txt")

# show how many divs are in the set
length(div.raws.l)

If you would like to see the results, you could run the div.freqs.l variable.

div.freqs.l

We might want to then determine the average word frequency, which is one way of assessing textual variance.

# mean word frequency
sum(div.raws.l[[1]])/length(div.raws.l[[1]])
mean(div.raws.l[[1]]) # meaning: each word type in the first marking is used an average of 1.058 times

Now we can plot the calculation:

# to extract the frequency data from all of the chapters at once
lapply(div.raws.l,mean)
# putting results into a matric object
mean.word.use.m <- do.call(rbind, lapply(div.raws.l,mean))
dim(mean.word.use.m)
# this reports 703 rows in 1 column, but there's more info in the matrix
plot(mean.word.use.m, type = "h", main = "Mean word usage patterns in Melville's markings in Shakespeare",
     ylab = "mean word use", xlab = "nodes (each marking): 
     1-288, comedies; 289-352, histories; 353-374, other; 375-653, tragedies")

We can use a different calculation to assess the type-token ratio, which is a different standard of textual variety:

# calculate TTR for first node
length(div.raws.l[[1]])/sum(div.raws.l[[1]])*100
# now use lapply to run across all nodes
ttr.l <- lapply(div.raws.l, function(x) {length(x)/sum(x)*100})
ttr.m <- do.call(rbind, ttr.l)
ttr.m[order(ttr.m, decreasing = TRUE),]
plot(ttr.m, type = "h", main = "Type-token ratios in Melville's markings in Shakespeare",
     ylab = "lexical variety", xlab = "nodes (each marking <div>)")

Working with TEI

The same code above applies to TEI XML documents, but in R you need to specify the xml namespace in your getNodeSet XPath function. Let’s try this on our Bad Hamlet file.

ham.doc <- xmlTreeParse("bad-hamlet.xml", useInternalNodes=TRUE)
ham.divs.ns.l <- getNodeSet(ham.doc, "/tei:TEI//tei:sp[@who='Hamlet']", namespaces = c(tei = "http://www.tei-c.org/ns/1.0")) # iterates over ('reads through') all Hamlet speeches in the XML file, and each element in the XPath expression is prefaced by a 'tei' declaration. Also note the extra argument within getNodeSet that links to the TEI namespace.
ham.divs.ns.l
div.freqs.l <- list() # create two list variables into which you will dump the XML data
div.raws.l <- list()

# the for loop grabs all the divs and calculates the relative frequencies among words 
for(i in 1:length(ham.divs.ns.l)){
  div.content <- xmlValue(ham.divs.ns.l[[i]], "sp")[[1]]
  words.ns <- xmlElementsByTagName(ham.divs.ns.l[[i]], "l", recursive = TRUE)
  div.words.v <- paste(sapply(words.ns, xmlValue), collapse=" ")
  words.lower.v <- tolower(div.words.v)
  words.l <- strsplit(words.lower.v, "[^a-z'-]")
  word.v <- unlist(words.l)
  word.v <- word.v[which(word.v!="")]
  div.freqs.t <- table(word.v)
  div.raws.l[[div.content]] <- div.freqs.t
  div.freqs.l[[div.content]] <- 100*(div.freqs.t/sum(div.freqs.t))
}
div.freqs.un.v <- unlist(div.freqs.l)
sorted.div.freqs.v <- sort(div.freqs.un.v, decreasing=T)

# just for your records, this creates a spreadsheet of all your results
write.table(sorted.div.freqs.v, "hamlet.freqs.v.sorted.txt")

# show how many divs are in the set
length(div.raws.l)
# mean word frequency
sum(div.raws.l[[1]])/length(div.raws.l[[1]])
mean(div.raws.l[[1]]) # meaning: each word type in the first speech is used an average of X times

# calculate TTR for first node
length(div.raws.l[[1]])/sum(div.raws.l[[1]])*100
# now use lapply to run across all nodes
ttr.l <- lapply(div.raws.l, function(x) {length(x)/sum(x)*100})
ttr.m <- do.call(rbind, ttr.l)
ttr.m[order(ttr.m, decreasing = TRUE),]
plot(ttr.m, type = "h", main = "Type-token ratios in Hamlet's speeches in the Bad Quarto",
     ylab = "lexical variety", xlab = "nodes (each <sp> by Hamlet)")

Exercise 5:

Let’s try this on a different file. Load a TEI XML file of your choice and modify the code above to generate results based on your research questions.

setwd("~/Desktop")
doc <- xmlTreeParse("*.xml", useInternalNodes=TRUE)

Publish your results

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

If you are interested in learning more about R and corpus linguistics, in addition to Silge and Robinson and Jockers, you could also consult R. H. Baayen’s Analyzing Linguistic Data: A practical introduction to statistics (Cambridge UP, 2008) and Stefan Gries’s Quantitative Corpus Linguistics with R, 2nd ed. (Routledge, 2017).

Some good web resources include Jeff Rydberg-Cox’s Introduction to R and David Silge and Julia Robinson’s Text Mining with R. Also be sure to examine the CRAN R Documentation site.

