Installing R and RStudio

Before the session, make sure to download the R software package from http://www.r-project.org/.

Then download the latest version of RStudio at https://www.rstudio.com.

Part I: A Brief Intro to Programming and R

What exactly is programming?

Every computer program is a series of instructions—a sequence of separate, small commands. The art of programming is to take a general idea and break it apart into separate steps. (This may be just as important as learning the rules and syntax of a particular language.)

Programming (or code) consists of either imperative or declarative style. R uses imperative style, meaning it strings together instructions to the computer. (Declarative style involves telling the computer what the end result should be, like HTML code.) There are many subdivisions of imperative style, but the primary concern for beginning programmers should be procedural style: that is, describing the steps for achieving a task.

Each step/instruction is a statement—words, numbers, or equations that express a thought.

Why are there so many languages?

The central processing unit (CPU) of the computer does not understand any of them! The CPU only takes in machine code, which runs directly on the computer’s hardware. Machine code is basically unreadable, though: it’s a series of tiny numerical operations.

Several popular computer programming languages are actually translations of machine code; they are literally interpreted—as opposed to a compiled—languages. They bridge the gap between machine code/computer hardware and the human programmer. What we call our source code is our set of statements in our preferred language that interacts with machine code.

Source code is simply written in plain text in a text editor. Do not use a word processor.

The computer understands source code by the file extension. For us, that means the “.R” extension (and the R notebook is “.Rmd”).

While you do not need a special program to write code, it is usually a good idea to use an IDE (integrated development environment) to help you. Many people (like me) use the oXygen IDE for editing XML documents and creating transformations with XSLT. Python users often use Spyder, Pycharm, or Anaconda Jupyter Notebooks.

For R, use RStudio (more on that in a moment).

Why are we using R?

Short answer: because I like R. I have learned some Python and JavaScript, too, but for some reason R worked better for me. This suggests an important takeaway from this session: there is no single language that is better than any other. What you chose to work with will depend on what materials you are working on, what level of comfort you have with a given language, and what kinds of outputs you would like from your code.

For example, if I am primarily interested in text-based edition projects, I would be wise to work mostly with XML technologies: TEI-XML, XPath, XSLT, XQuery, XProc, just to name a few. However, I have seen people use Python and JavaScript to transform XML. While I would advocate XSLT for such an operation, it is better for you to use your preferred language to get things done.

XML, R, Python, and JavaScript are all open-source languages. R and Python are quite similar and both can basically perform the same tasks.

That said, R does have some distinct advantages:

  • The visualisation libraries are excellent. With RStudio, reporting your results is almost instantaneous.

  • R Markdown makes it easy to integrate the code and the results in a web browser.

  • It is a functional language (meaning almost everything is accomplished through functions), which works well for some.

  • It was built by data scientists and linguists, so it is optimal for doing statistical analyses with structured text and data sets. (Python is probably better for more general purpose and natural language processing tasks.)

  • It tends to be used by academics and researchers, so it works well with research questions (Python, on the other hand, is used more in private development.)

  • Strong user community: In CRAN (R’s open-source repository), around 12000 packages are currently available.

Some notable disadvantages:

  • It is slow with large data sets (in which case you might want a server instantiation of RStudio).

  • Its programming paradigm is somewhat non-standard. It is based on the commercial S and S-plus languages).

The R Environment (for those who are new to R)

When you first launch R, you will see a console:

R image

R image

This interface allows you to run R commands just like you would run commands on a Bash scripting shell.

When you open this file in RStudio, the command line interface (labeled “Console”) is below the editing window. When you run a code block in the editing window, you will see the results appear in the Console below.

About R Markdown

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. The results can also be published as an HTML file.

A quick example: let’s do some math. Say I am making a travel budget, and I want to add the cost of hotel and flight prices for a trip to Seattle. The flight is £550 and the hotel price per night is £133. R can do the work for you.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter. (On a Windows machine you would press [Windows button]+Shift+Enter.)

550 + 133

R can make all kinds of calculations, so if you want to get the total cost of a five-day trip to Seattle, you can add an operator for multiplication.

550 + 133 * 5

R is an object-oriented programming language, so practically everything can be stored as an R object. The commands are either expressions or assignments. An expression (when written as a command) is evaluated and printed (unless specifically made invisible), but the value is lost. An assignment also evaluates an expression but the the value is stored in a variable (and the result is not automatically printed).

To make our calculations effective, we need to store these kinds of calculations in variables. Variables can be assigned with either <- or =. Let’s do that by comparing the price of a 5-day trip to Seattle to a 7-day trip to Paris.

sea.trip.v <- 550 + 133 * 5

paris.trip.v <- 110 + 90 * 7

sea.trip.v < paris.trip.v

What is the most expensive trip?

Guess we should go to Paris. What if I just want to do both?

sea.trip.v + paris.trip.v

Suppose further that I wanted to add in an optional 3-day trip to New York City. I want to see which trip would be more expensive if I were to take two out of the three options.

nyc.trip.v <- 335 + 175 * 3

sea.and.nyc <- sea.trip.v + nyc.trip.v 

sea.and.paris <- sea.trip.v + paris.trip.v

paris.and.nyc <- paris.trip.v + nyc.trip.v

Above you can see how powerful even simple R programming can be: you can store mathemtical operations in named variables and use those variables to work with other variables (this becomes very important in word frequency calculations). You can also plot the results for quick assessment.

trips <- c(sea.and.nyc, sea.and.paris, paris.and.nyc)
barplot(trips, ylab = "Cost of each trip", names.arg = c("Seattle and NYC", "Seattle and Paris", "Paris and NYC"))

You see how this works, and how quickly one can store variables for even practical questions.

Reading Data in R

There are other important kinds of R data formats that you should know. The first is a vector, which is a single variable consisting of an ordered collection numbers and/or words. An easy way to create a vector is to use the c command, which basically means “combine.”

v1 <- c("i", "wait", "with", "bated", "breath")

# confirm the value of the variable by running v1
v1

# identify a specific value by indicating it is brackets
v1[4]

Get used to the functions that help you understand R: ? and example().

?c

example(c, echo = FALSE) # change the echo value to TRUE to get the results

The c function is widely used, but it is really only useful for creating small data sets. Many of you will probably want to load text files.

Jeff Rydberg-Cox provides some helpful tips for preparing data for R processing:

  • Download the text(s) from a source repository.

  • Remove extraneous material from the text(s).

  • Transform the text(s) to answer your research questions.

The best way to load text files is with the scan function. (The other important function is read.table, which handles csv files.) First, download a text file of Dickens’s Great Expectations onto your working directory.

dickens.v <- scan("great-expectations.txt", what="character", sep="\n", encoding = "UTF-8")
Read 3913 items

You have now loaded Great Expectations into a variable called dickens.v.

With the text loaded, you can now run quick statistical operations, such as the number of lines and word frequencies.

dickens.words.v[1:20] # find the first 20 ten words in Great Expectations
 [1] "chapter"   "i"         "my"        "father"    "s"        
 [6] "family"    "name"      "being"     "pirrip"    ""         
[11] "and"       "my"        "christian" "name"      "philip"   
[16] ""          "my"        "infant"    "tongue"    "could"    

Did you notice the “\W” in the strsplit argument? What is that again? Regex! Notice that in R you need to use another backslash to indicate a character escape.

Also, did you notice the blank result on the 10th word? This requires a little clean-up step.

Extra white spaces often cause problems for text analysis.

dickens.words.v[1:20]
 [1] "chapter"   "i"         "my"        "father"    "s"        
 [6] "family"    "name"      "being"     "pirrip"    "and"      
[11] "my"        "christian" "name"      "philip"    "my"       
[16] "infant"    "tongue"    "could"     "make"      "of"       

Voila! We might want to examine how many times the third result “father” occurs (the fourth word result, and one that will probably be an important word in this book).

length(dickens.words.v[which(dickens.words.v=="father")])
[1] 69

Or produce a list of all unique words.

unique(sort(dickens.words.v, decreasing = FALSE))[1:50]
 [1] "0037m"      "0072m"      "0082m"      "0132m"      "0189m"     
 [6] "0223m"      "0242m"      "0245m"      "0279m"      "0295m"     
[11] "0335m"      "0348m"      "0393m"      "0399m"      "1"         
[16] "2"          "a"          "aback"      "abandoned"  "abased"    
[21] "abashed"    "abbey"      "abear"      "abel"       "aberdeen"  
[26] "aberration" "abet"       "abeyance"   "abhorrence" "abhorrent" 
[31] "abhorring"  "abide"      "abided"     "abilities"  "ability"   
[36] "abject"     "able"       "ablutions"  "aboard"     "abode"     
[41] "abominate"  "about"      "above"      "abraham"    "abreast"   
[46] "abroad"     "abrupt"     "abruptness" "absence"    "absent"    

Here we find another problem: we find in our unique word list some odd non-words such as “0037m.” We should strip those out.

Exercise 1

Create a regular expression to remove those non-words in dickens.words.v? Remember that you use two backslashes (//) for character escape. For more information on using regex in R, RStudio has a helpful cheat sheet.

Now let’s re-run that not.blanks vector to strip out the blank you just added.

unique(sort(dickens.words.clean.v, decreasing = FALSE))[1:50]
 [1] "a"           "aback"       "abandoned"   "abased"     
 [5] "abashed"     "abbey"       "abear"       "abel"       
 [9] "aberdeen"    "aberration"  "abet"        "abeyance"   
[13] "abhorrence"  "abhorrent"   "abhorring"   "abide"      
[17] "abided"      "abilities"   "ability"     "abject"     
[21] "able"        "ablutions"   "aboard"      "abode"      
[25] "abominate"   "about"       "above"       "abraham"    
[29] "abreast"     "abroad"      "abrupt"      "abruptness" 
[33] "absence"     "absent"      "absolute"    "absolutely" 
[37] "absolve"     "absorbed"    "abstinence"  "abstraction"
[41] "abstracts"   "absurd"      "absurdest"   "absurdly"   
[45] "abundance"   "abundantly"  "abyss"       "accept"     
[49] "acceptable"  "acceptance" 

Returning to basic functions, now that we have done some more clean-up: how many unique words are in the book?

length(unique(dickens.words.clean.v))
[1] 10744

Divide this by the amount of words in the whole book to calculate vocabulary density ratios.

unique.words/total.words 
[1] 0.05688569

That’s actually a fairly small density number, 5.7% (Moby-Dick by comparison is about 8%).

The other important data structures are tables and data frames. These are probably the most useful for sophisticated analyses, because it renders the data in a table that is very similar to a spreadsheet. It is important to input your data in an Excel or Google docs spreadsheet and then export that data into a comma separated value (.csv) or tab separated value (.tsv) file. Many of the tidytext operations work with data frames, as we’ll see later.

Flow control

Flow control involves stochastic simulation, or repetitive operations or pattern recognition—two of the more important reasons why we use programming languages. The most common form of stochastic simulation is the for() loop. This is a logical command with the following syntax

for (name in vector) {[enter commands]}

This sets a variable called name equal to each of the elements of the vector in sequence. Each of these iterates over the command as many times as is necessary.

A simple example is the Fibonacci sequence. A for() loop can automatically generate the first 20 Fibonacci numbers.

Fibonacci
 [1]    1    1    2    3    5    8   13   21   34   55   89  144  233
[14]  377  610  987 1597 2584 4181 6765

There is another important component to flow control: the conditional. In programming this takes the form of if() statements.

Syntax

if (condition) {commands when TRUE}

if (condition) {commands when TRUE} else {commands when FALSE}

We will not have time to go into details regarding these operations, but it is important to recognize them when you are reading or modifying someone else’s code.

Now, using what we know about regular expressions and flow control, let’s have look at a for() loop that Matthew Jockers uses in Chapter 4 of his Text Analysis for Students of Literature. It’s a fairly complicated but useful way of breaking up a novel text into chapters for comparative analysis. Let’s return to Dickens.

length(chapter.freqs.l)[1]
[1] 58

Suppose I wanted to get all relative frequencies of the word “father” in each chapter.

father.freqs
$`Chapter I`
   father 
0.2696872 

$`Chapter II`
<NA> 
  NA 

$`Chapter III`
<NA> 
  NA 

$`Chapter IV`
<NA> 
  NA 

$`Chapter V`
<NA> 
  NA 

$`Chapter VI`
<NA> 
  NA 

$`Chapter VII`
   father 
0.1470588 

$`Chapter VIII`
<NA> 
  NA 

$`Chapter IX`
    father 
0.03696858 

$`Chapter X`
<NA> 
  NA 

$`Chapter XI`
<NA> 
  NA 

$`Chapter XII`
<NA> 
  NA 

$`Chapter XIII`
<NA> 
  NA 

$`Chapter XIV`
<NA> 
  NA 

$`Chapter XV`
<NA> 
  NA 

$`Chapter XVI`
<NA> 
  NA 

$`Chapter XVII`
<NA> 
  NA 

$`Chapter XVIII`
<NA> 
  NA 

$`Chapter XIX`
<NA> 
  NA 

$`Chapter XX`
   father 
0.0621118 

$`Chapter XXI`
   father 
0.1105583 

$`Chapter XXII`
   father 
0.3774335 

$`Chapter XXIII`
    father 
0.03092146 

$`Chapter XXIV`
<NA> 
  NA 

$`Chapter XXV`
<NA> 
  NA 

$`Chapter XXVI`
<NA> 
  NA 

$`Chapter XXVII`
    father 
0.06512537 

$`Chapter XXVIII`
<NA> 
  NA 

$`Chapter XXIX`
    father 
0.02010859 

$`Chapter XXX`
   father 
0.2646281 

$`Chapter XXXI`
<NA> 
  NA 

$`Chapter XXXII`
<NA> 
  NA 

$`Chapter XXXIII`
<NA> 
  NA 

$`Chapter XXXIV`
    father 
0.04230118 

$`Chapter XXXV`
<NA> 
  NA 

$`Chapter XXXVI`
<NA> 
  NA 

$`Chapter XXXVII`
    father 
0.03483107 

$`Chapter XXXVIII`
<NA> 
  NA 

$`Chapter XXXIX`
    father 
0.02002403 

$`Chapter XL`
<NA> 
  NA 

$`Chapter XLI`
<NA> 
  NA 

$`Chapter XLII`
<NA> 
  NA 

$`Chapter XLIII`
<NA> 
  NA 

$`Chapter XLIV`
<NA> 
  NA 

$`Chapter XLV`
<NA> 
  NA 

$`Chapter XLVI`
   father 
0.1315789 

$`Chapter XLVII`
<NA> 
  NA 

$`Chapter XLVIII`
<NA> 
  NA 

$`Chapter XLIX`
<NA> 
  NA 

$`Chapter L`
  father 
0.130719 

$`Chapter LI`
   father 
0.2979146 

$`Chapter LII`
<NA> 
  NA 

$`Chapter LIII`
    father 
0.01871958 

$`Chapter LIV`
<NA> 
  NA 

$`Chapter LV`
    father 
0.03460208 

$`Chapter LVI`
<NA> 
  NA 

$`Chapter LVII`
<NA> 
  NA 

$`Chapter LVIII`
    father 
0.03207184 

You could also use variations of the which function to identify the chapters with the highest and lowest frequencies.

which.max(father.freqs)
Chapter XXII 
          22 

Exercise 2

Create a vector that confines your results to only the paragraphs with dialogue.

dialogue.v <- grep('("([^"]|"")*")', novel.lines.v) # grep is another regex function

novel.lines.v[dialogue.v][1:20] # check your work by finding all the dialogue lines in novel.lines.v

Bonus Exercise

Modify the for loop in Jockers to find word frequencies only of content with dialogue.

dialogue.chapter.raws.l <- list()
dialogue.chapter.freqs.l <- list()

for(i in 1:length(chap.positions.v)){
    if(i != length(chap.positions.v)){
chapter.title <- novel.lines.v[chap.positions.v[i]]
start <- chap.positions.v[i]+1
end <- chap.positions.v[i+1]-1
chapter.lines.v <- novel.lines.v[start:end]
dialogue.lines.v <- grep('"(.*?)"', chapter.lines.v, value = TRUE) # here is the grep again, pruning the chapter.lines vector into lines with dialogue
chapter.words.v <- tolower(paste(dialogue.lines.v, collapse=" ")) 
chapter.words.l <- strsplit(chapter.words.v, "\\W")
chapter.word.v <- unlist(chapter.words.l)
chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] 
chapter.freqs.t <- table(chapter.word.v) 
dialogue.chapter.raws.l[[chapter.title]] <- chapter.freqs.t 
chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) 
dialogue.chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel
    } 
}

dialogue.chapter.freqs.l[1]

Part II: Using TidyText to ‘read’ all of Livy

For these two lessons we will be modifying code from Julia Silge and David Robinson’s Text Mining with R: A Tidy Approach.

Before getting started, make sure you have set your working directory.

setwd("~/Desktop")

We did this to situate ourselves correctly within the filing system: we set our working directory to a reasonable place, the Desktop.

Note that the squiggly line (~) tells the system to return to the root (or home) directory, and your Desktop should be the next step (/) from the root. In Windows you would need to type out the file path, so something like C:\Users\[username]\Desktop.

Next we load the necessary libraries for these lessons. Note: If you get error messages, you will need to install the libraries by navigating to the “Packages” tab on the right-side panel of RStudio. Then click “Install,” enter the name of the package, and install it.

library(tidytext)
library(dplyr)
library(stringr)
library(glue)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(gutenbergr)

Before going into more details, I will briefly explain the ‘tidy’ approach to data that will be used in the following. The tidy approach assumes three principles regarding data structure:1

What results is a table with one-token-per-row. (Recall that a token is any meaningful unit of text: usually it is a word, but it can also be an n-gram, sentence, or even a root of a word.)

pound_poem <- c("The apparition of these faces in the crowd;", "Petals on a wet, black bough.")

pound_poem

Here we have created a character vector like we did before: the vector consists of two strings of text. In order to transform this into tidy format, we need to transform it into a data frame (here called a ‘tibble’, a type of data frame in R that is more convenient for text-based analysis).

pound_poem_df <- tibble(line = 1:2, text = pound_poem)

pound_poem_df

While better, this format is still not useful for tidy text analysis because we still need each word to be individually accounted for. To accomplish this act of tokenization, use the unnest_tokens function.

pound_poem_df %>% unnest_tokens(word, text)
# the unnest_tokens function requires two arguments: the output column name (word), and the input column that the text comes from (text)

Notice how each word is in its own row, but also that its original line number is still intact. That is the basic logic of tidy text analysis. Now let’s apply this to a larger data set.

Using the gutenbergrpackage with tidytext:

By running the gutenberg_authors function, you can see the file format of the names.

gutenberg_authors

Let’s run our first file loading function.

# this searches gutenberg for titles with the author name specified after the 'str_detect' function
gutenberg_works(str_detect(author, "Livy"))$title

Did you notice anything wrong with this? The first result duplicates some of the content of the fourth, so we should not use that first text id. Remember, the first rule of scholarship is TRUST NO ONE. In computing, never trust your data. So we’ll narrow the ingestion of the gutenberg ids to start with the second result.

# creates a variable that takes all the gutenberg ids of 
ids <- gutenberg_works(str_detect(author, "Livy"))$gutenberg_id[2:5]

livy <- gutenbergr::gutenberg_download(ids)
livy <- livy %>%
  group_by(gutenberg_id) %>%
  mutate(line = row_number()) %>%
  ungroup()

Here we created a new vector called livy and invoked the ‘gutenberg_works’ function to find Livy. What does the gutenberg_download function do? Again, type in the ? before the function to receive a description from the R Documentation. Try the example function, too.

Also, from the code above you might be wondering what the $ and %>% symbols mean. The $ refers to a variable. The %>% is a connector (a pipe) that mimics nesting. The rule is that the object on the left side is passed as the first argument to the function on the right hand side, so considering the last two lines, mutate(line = row_number()) %>% ungroup() is the same as ungroup(mutate(line = row_number())). It just makes the code (and particularly multi-step functions) more readable.2

?gutenberg_download

Now let’s see what we have downloaded. R has a summary function to show metadata about the new vector we just created, livy.

summary(livy)

Now we transform this into a tidy data set.

tidy_livy <- livy %>%
  unnest_tokens(output = word, input = text, token = "words")
  
tidy_livy %>% 
  count(word, sort = TRUE) %>%
  filter(n > 4000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Now we are mostly seeing functions words in these results. But what is interesting about the function words? Notice the prominence of pronouns, for example.

Of course you will want to complement these results with substantive results (i.e., with stop words filtered out).

data(stop_words)

tidy_livy <- tidy_livy %>%
  anti_join(stop_words)

livy_plot <- tidy_livy %>% 
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  ylab("Word frequencies in Livy's History of Rome") +
  coord_flip()

livy_plot

In the visual above, you might want to locate the button in the upper right corner ‘Show in New Window’, so that you can zoom the results out.

We might also want to read (or have a searchable list in a table) of the word frequencies. The first code block below renders the results above in a table, and the second code block writes all of the results into a csv (spreadsheet) file.

tidy_livy %>%
  count(word, sort = TRUE)

livy_words <- tidy_livy %>%
  count(word, sort = TRUE)

write_csv(livy_words, "livy_words.csv")

# Note that if you want to retain the tidy data (that is, the title-line-word columns in multiple works, say),
# then you would just invoke the tidy_livy variable: write_csv(tidy_livy, "livy_words.csv")

Much of what we have done can also be done in Voyant Tools, to be sure. However, we have been able to load data faster in R, and we have also organized the data is tidytext tables that allow us to make judgments about the similarities and differences between the works in the corpus. It is also important to stress that you retain more control over organizing and manipulating your data with R, whereas in Voyant you are beholden to unstructured text files in a pre-built visualization interface.

To illustrate this flexibility, let’s investigate the data in ways that are unique to R (and programming in general).

We might want to make similar calculations by book, which is easier now due to the tidy data structure.

livy_word_freqs_by_book <- tidy_livy %>%
  group_by(gutenberg_id) %>%
  count(word, sort = TRUE) %>%
  ungroup()

livy_word_freqs_by_book %>%
    filter(n > 250) %>%
    ggplot(mapping = aes(x = word, y = n)) +
    geom_col() +
    coord_flip()

This shows you the general trend of each word that is used more than 250 times in alphabetical order. We can also break up the results into individual graphs for each book.

livy_word_freqs_by_book %>%
    filter(n > 250) %>%
    ggplot(mapping = aes(x = word, y = n)) +
    geom_col() +
    coord_flip() + facet_wrap(facets = ~ gutenberg_id)

This might appear to be an overwhelming picture, but it is an immediate display of similarities and differences between books. Granted, they are slightly out of order (id 10907 is The History of Rome, Books 09 to 26, and 12582 is Books 01 to 08), but you can immediately notice how the first half differs from the second in its content.

We could re-engineer the code in the previous examples to look more closely at these results. First we’ll narrow our data set to the more interesting id numbers mentioned already.

livy2 <- gutenberg_download(c(10907, 44318))

livy_tidy2 <- livy2 %>%
  group_by(gutenberg_id) %>%
  mutate(line = row_number()) %>%
  ungroup()

livy_tidy2 <- livy_tidy2 %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

livy_word_freqs_by_book <- livy_tidy2 %>%
  group_by(gutenberg_id) %>%
  count(word, sort = TRUE) %>%
  ungroup()

livy_word_freqs_by_book %>%
    filter(n > 210) %>%
    ggplot(mapping = aes(x = word, y = n)) +
    geom_col() +
    coord_flip() + facet_wrap(facets = ~ gutenberg_id)

What is the most consistent word used throughout Livy’s History?

Let’s now compare these results to another important chronicler, from a different era: Herodotus.

herodotus <- gutenberg_download(c(2707, 2456))

This downloads the two-volume Histories of Herodotus e-text (note that the c values are the gutenberg ids of two vols of Herodotus’ Histories. The ids can be found by searching for texts on gutenberg.org, clicking on the Bibrec tab, and copying the EBook-No.).

tidy_herodotus <- herodotus %>%
  unnest_tokens(word, text)

tidy_herodotus %>%
  count(word, sort = TRUE)

What are the differences here with the Livy results?

Now let’s filter out the stop words again.

tidy_herodotus <- herodotus %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) 

tidy_herodotus %>%
  count(word, sort = TRUE)

We could also add into the mix yet another text. Let’s try Edward Gibbon.

gibbon <- gutenberg_works(author == "Gibbon, Edward") %>%
  gutenberg_download(meta_fields = "title")

tidy_gibbon <- gibbon %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

tidy_gibbon %>%
  count(word, sort = TRUE)

Let’s visualize the differences.

frequency <- bind_rows(mutate(tidy_livy, author = "Livy"),
                       mutate(tidy_herodotus, author = "Herodotus"),
                       mutate(tidy_gibbon, author = "Edward Gibbon")) %>% 
  mutate(word = str_extract(word, "['a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion) %>% 
  gather(author, proportion, `Livy`:`Herodotus`)
library(scales)
ggplot(frequency, aes(x = proportion, y = `Edward Gibbon`, color = abs(`Edward Gibbon` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Edward Gibbon", x = NULL)

Words that group near the upper end of the diagonal line in these plots have similar frequencies in both sets of texts.

cor.test(data = frequency[frequency$author == "Livy",],
         ~ proportion + `Edward Gibbon`)
cor.test(data = frequency[frequency$author == "Herodotus",],
         ~ proportion + `Edward Gibbon`)

What this proves (statistically) is that the word frequencies of Gibbon are more correlated to Herodotus than to Livy—which is fascinating, given that Gibbon what writing about the same subject as Livy!

What else can you infer from these comparisons?


  1. For more on this, see Hadley Wickham’s “Tidy Data,” Journal of Statistical Software 59 (2014): 1–23. https://doi.org/10.18637/jss.v059.i10.

  2. Granted, it is not part of R’s base code, but it was defined by the magrittr package and is now widely used in the dplyr and tidyr packages.

LS0tCnRpdGxlOiAnSW50cm9kdWN0aW9uIHRvIFIgKDEpOiBMb25kb24gUmFyZSBCb29rcyBTY2hvb2wsIDUgSnVseSAyMDE5JwpvdXRwdXQ6CiAgaHRtbF9kb2N1bWVudDoKICAgIHRvYzogeWVzCiAgaHRtbF9ub3RlYm9vazoKICAgIHRoZW1lOiB1bml0ZWQKICAgIHRvYzogeWVzCi0tLQoKKipJbnN0YWxsaW5nIFIgYW5kIFJTdHVkaW8qKgoKQmVmb3JlIHRoZSBzZXNzaW9uLCBtYWtlIHN1cmUgdG8gZG93bmxvYWQgdGhlIFIgc29mdHdhcmUgcGFja2FnZSBmcm9tIGh0dHA6Ly93d3cuci1wcm9qZWN0Lm9yZy8uCgotIENsaWNrIG9uICJkb3dubG9hZCBSLiIKCi0gQ2hvb3NlIHRoZSBhcHByb3ByaWF0ZSBDUkFOIG1pcnJvciBpbiB5b3VyIGFyZWEgZm9yIGRvd25sb2FkaW5nIChmb3IgbWUgaXQncyB0aGUgVUsgPiBJbXBlcmlhbCBDb2xsZWdlIExvbmRvbiBsaW5rKS4KCi0gRG93bmxvYWQgYW5kIGluc3RhbGwgdGhlIGFwcHJvcHJpYXRlIFIgMy41LjIgYmluYXJ5IGZvciB5b3VyIG9wZXJhdGluZyBzeXN0ZW0uCgpUaGVuIGRvd25sb2FkIHRoZSBsYXRlc3QgdmVyc2lvbiBvZiBSU3R1ZGlvIGF0IGh0dHBzOi8vd3d3LnJzdHVkaW8uY29tLgoKLSBDbGljayBvbiAiRG93bmxvYWQgUlN0dWRpby4iCgotIERvd25sb2FkIHRoZSBSU3R1ZGlvIERlc2t0b3AgKGZyZWUpIHZlcnNpb24uCgotIENob3NlIHRoZSBhcHByb3ByaWF0ZSBpbnN0YWxsZXI6IE1vc3Qgb2YgeW91IHdpbGwgdXNlIGVpdGhlciBSU3R1ZGlvIDEuMS40NjMgLSBXaW5kb3dzIFZpc3RhLzcvOC8xMCBvciBNYWMgT1MgWCAxMC42Ky4KCiMjIFBhcnQgSTogQSBCcmllZiBJbnRybyB0byBQcm9ncmFtbWluZyBhbmQgUgoKIyMjIFdoYXQgZXhhY3RseSBpcyBwcm9ncmFtbWluZz8KCkV2ZXJ5IGNvbXB1dGVyIHByb2dyYW0gaXMgYSBzZXJpZXMgb2YgaW5zdHJ1Y3Rpb25zLS0tYSBzZXF1ZW5jZSBvZiBzZXBhcmF0ZSwgc21hbGwgY29tbWFuZHMuIFRoZSBhcnQgb2YgcHJvZ3JhbW1pbmcgaXMgdG8gdGFrZSBhIGdlbmVyYWwgaWRlYSBhbmQgYnJlYWsgaXQgYXBhcnQgaW50byBzZXBhcmF0ZSBzdGVwcy4gKFRoaXMgbWF5IGJlIGp1c3QgYXMgaW1wb3J0YW50IGFzIGxlYXJuaW5nIHRoZSBydWxlcyBhbmQgc3ludGF4IG9mIGEgcGFydGljdWxhciBsYW5ndWFnZS4pCgpQcm9ncmFtbWluZyAob3IgY29kZSkgY29uc2lzdHMgb2YgZWl0aGVyIGltcGVyYXRpdmUgb3IgZGVjbGFyYXRpdmUgc3R5bGUuIFIgdXNlcyBpbXBlcmF0aXZlIHN0eWxlLCBtZWFuaW5nIGl0IHN0cmluZ3MgdG9nZXRoZXIgaW5zdHJ1Y3Rpb25zIHRvIHRoZSBjb21wdXRlci4gKERlY2xhcmF0aXZlIHN0eWxlIGludm9sdmVzIHRlbGxpbmcgdGhlIGNvbXB1dGVyIHdoYXQgdGhlIGVuZCByZXN1bHQgc2hvdWxkIGJlLCBsaWtlIEhUTUwgY29kZS4pIFRoZXJlIGFyZSBtYW55IHN1YmRpdmlzaW9ucyBvZiBpbXBlcmF0aXZlIHN0eWxlLCBidXQgdGhlIHByaW1hcnkgY29uY2VybiBmb3IgYmVnaW5uaW5nIHByb2dyYW1tZXJzIHNob3VsZCBiZSBwcm9jZWR1cmFsIHN0eWxlOiB0aGF0IGlzLCBkZXNjcmliaW5nIHRoZSBzdGVwcyBmb3IgYWNoaWV2aW5nIGEgdGFzay4KCkVhY2ggc3RlcC9pbnN0cnVjdGlvbiBpcyBhICpzdGF0ZW1lbnQqLS0td29yZHMsIG51bWJlcnMsIG9yIGVxdWF0aW9ucyB0aGF0IGV4cHJlc3MgYSB0aG91Z2h0LgoKIyMjIFdoeSBhcmUgdGhlcmUgc28gbWFueSBsYW5ndWFnZXM/CgpUaGUgY2VudHJhbCBwcm9jZXNzaW5nIHVuaXQgKENQVSkgb2YgdGhlIGNvbXB1dGVyIGRvZXMgbm90IHVuZGVyc3RhbmQgYW55IG9mIHRoZW0hIFRoZSBDUFUgb25seSB0YWtlcyBpbiAqbWFjaGluZSBjb2RlKiwgd2hpY2ggcnVucyBkaXJlY3RseSBvbiB0aGUgY29tcHV0ZXIncyBoYXJkd2FyZS4gTWFjaGluZSBjb2RlIGlzIGJhc2ljYWxseSB1bnJlYWRhYmxlLCB0aG91Z2g6IGl0J3MgYSBzZXJpZXMgb2YgdGlueSBudW1lcmljYWwgb3BlcmF0aW9ucy4KClNldmVyYWwgcG9wdWxhciBjb21wdXRlciBwcm9ncmFtbWluZyBsYW5ndWFnZXMgYXJlIGFjdHVhbGx5IHRyYW5zbGF0aW9ucyBvZiBtYWNoaW5lIGNvZGU7IHRoZXkgYXJlIGxpdGVyYWxseSBpbnRlcnByZXRlZC0tLWFzIG9wcG9zZWQgdG8gYSBjb21waWxlZC0tLWxhbmd1YWdlcy4gVGhleSBicmlkZ2UgdGhlIGdhcCBiZXR3ZWVuIG1hY2hpbmUgY29kZS9jb21wdXRlciBoYXJkd2FyZSBhbmQgdGhlIGh1bWFuIHByb2dyYW1tZXIuIFdoYXQgd2UgY2FsbCBvdXIgKnNvdXJjZSBjb2RlKiBpcyBvdXIgc2V0IG9mIHN0YXRlbWVudHMgaW4gb3VyIHByZWZlcnJlZCBsYW5ndWFnZSB0aGF0IGludGVyYWN0cyB3aXRoIG1hY2hpbmUgY29kZS4KClNvdXJjZSBjb2RlIGlzIHNpbXBseSB3cml0dGVuIGluIHBsYWluIHRleHQgaW4gYSB0ZXh0IGVkaXRvci4gKipEbyBub3QqKiB1c2UgYSB3b3JkIHByb2Nlc3Nvci4KClRoZSBjb21wdXRlciB1bmRlcnN0YW5kcyBzb3VyY2UgY29kZSBieSB0aGUgZmlsZSBleHRlbnNpb24uIEZvciB1cywgdGhhdCBtZWFucyB0aGUgIi5SIiBleHRlbnNpb24gKGFuZCB0aGUgUiBub3RlYm9vayBpcyAiLlJtZCIpLgoKV2hpbGUgeW91IGRvIG5vdCBuZWVkIGEgc3BlY2lhbCBwcm9ncmFtIHRvIHdyaXRlIGNvZGUsIGl0IGlzIHVzdWFsbHkgYSBnb29kIGlkZWEgdG8gdXNlIGFuICoqSURFKiogKGludGVncmF0ZWQgZGV2ZWxvcG1lbnQgZW52aXJvbm1lbnQpIHRvIGhlbHAgeW91LiBNYW55IHBlb3BsZSAobGlrZSBtZSkgdXNlIHRoZSBbb1h5Z2VuXShodHRwczovL3d3dy5veHlnZW54bWwuY29tLykgSURFIGZvciBlZGl0aW5nIFhNTCBkb2N1bWVudHMgYW5kIGNyZWF0aW5nIHRyYW5zZm9ybWF0aW9ucyB3aXRoIFhTTFQuIFB5dGhvbiB1c2VycyBvZnRlbiB1c2UgW1NweWRlcl0oaHR0cHM6Ly93d3cuc3B5ZGVyLWlkZS5vcmcvKSwgW1B5Y2hhcm1dKGh0dHBzOi8vd3d3LmpldGJyYWlucy5jb20vcHljaGFybS8pLCBvciBbQW5hY29uZGFdKGh0dHBzOi8vd3d3LmFuYWNvbmRhLmNvbS8pIEp1cHl0ZXIgTm90ZWJvb2tzLiAKCkZvciBSLCB1c2UgW1JTdHVkaW9dKGh0dHBzOi8vd3d3LnJzdHVkaW8uY29tLykgKG1vcmUgb24gdGhhdCBpbiBhIG1vbWVudCkuIAoKIyMjIFdoeSBhcmUgd2UgdXNpbmcgUj8KClNob3J0IGFuc3dlcjogYmVjYXVzZSBJIGxpa2UgUi4gSSBoYXZlIGxlYXJuZWQgc29tZSBQeXRob24gYW5kIEphdmFTY3JpcHQsIHRvbywgYnV0IGZvciBzb21lIHJlYXNvbiBSIHdvcmtlZCBiZXR0ZXIgZm9yIG1lLiBUaGlzIHN1Z2dlc3RzIGFuIGltcG9ydGFudCB0YWtlYXdheSBmcm9tIHRoaXMgc2Vzc2lvbjogdGhlcmUgaXMgbm8gc2luZ2xlIGxhbmd1YWdlIHRoYXQgaXMgKmJldHRlciogdGhhbiBhbnkgb3RoZXIuIFdoYXQgeW91IGNob3NlIHRvIHdvcmsgd2l0aCB3aWxsIGRlcGVuZCBvbiB3aGF0IG1hdGVyaWFscyB5b3UgYXJlIHdvcmtpbmcgb24sIHdoYXQgbGV2ZWwgb2YgY29tZm9ydCB5b3UgaGF2ZSB3aXRoIGEgZ2l2ZW4gbGFuZ3VhZ2UsIGFuZCB3aGF0IGtpbmRzIG9mIG91dHB1dHMgeW91IHdvdWxkIGxpa2UgZnJvbSB5b3VyIGNvZGUuCgpGb3IgZXhhbXBsZSwgaWYgSSBhbSBwcmltYXJpbHkgaW50ZXJlc3RlZCBpbiB0ZXh0LWJhc2VkIGVkaXRpb24gcHJvamVjdHMsIEkgd291bGQgYmUgd2lzZSB0byB3b3JrIG1vc3RseSB3aXRoIFhNTCB0ZWNobm9sb2dpZXM6IFRFSS1YTUwsIFhQYXRoLCBYU0xULCBYUXVlcnksIFhQcm9jLCBqdXN0IHRvIG5hbWUgYSBmZXcuIEhvd2V2ZXIsIEkgaGF2ZSBzZWVuIHBlb3BsZSB1c2UgUHl0aG9uIGFuZCBKYXZhU2NyaXB0IHRvIHRyYW5zZm9ybSBYTUwuIFdoaWxlIEkgd291bGQgYWR2b2NhdGUgWFNMVCBmb3Igc3VjaCBhbiBvcGVyYXRpb24sIGl0IGlzIGJldHRlciBmb3IgeW91IHRvIHVzZSB5b3VyIHByZWZlcnJlZCBsYW5ndWFnZSB0byBnZXQgdGhpbmdzIGRvbmUuCgpYTUwsIFIsIFB5dGhvbiwgYW5kIEphdmFTY3JpcHQgYXJlIGFsbCBvcGVuLXNvdXJjZSBsYW5ndWFnZXMuIFIgYW5kIFB5dGhvbiBhcmUgcXVpdGUgc2ltaWxhciBhbmQgYm90aCBjYW4gYmFzaWNhbGx5IHBlcmZvcm0gdGhlIHNhbWUgdGFza3MuCgpUaGF0IHNhaWQsIFIgZG9lcyBoYXZlIHNvbWUgZGlzdGluY3QgYWR2YW50YWdlczoKCi0gVGhlIHZpc3VhbGlzYXRpb24gbGlicmFyaWVzIGFyZSBleGNlbGxlbnQuIFdpdGggUlN0dWRpbywgcmVwb3J0aW5nIHlvdXIgcmVzdWx0cyBpcyBhbG1vc3QgaW5zdGFudGFuZW91cy4KCi0gUiBNYXJrZG93biBtYWtlcyBpdCBlYXN5IHRvIGludGVncmF0ZSB0aGUgY29kZSBhbmQgdGhlIHJlc3VsdHMgaW4gYSB3ZWIgYnJvd3Nlci4KCi0gSXQgaXMgYSBmdW5jdGlvbmFsIGxhbmd1YWdlIChtZWFuaW5nIGFsbW9zdCBldmVyeXRoaW5nIGlzIGFjY29tcGxpc2hlZCB0aHJvdWdoIGZ1bmN0aW9ucyksIHdoaWNoIHdvcmtzIHdlbGwgZm9yIHNvbWUuCgotIEl0IHdhcyBidWlsdCBieSBkYXRhIHNjaWVudGlzdHMgYW5kIGxpbmd1aXN0cywgc28gaXQgaXMgb3B0aW1hbCBmb3IgZG9pbmcgc3RhdGlzdGljYWwgYW5hbHlzZXMgd2l0aCBzdHJ1Y3R1cmVkIHRleHQgYW5kIGRhdGEgc2V0cy4gKFB5dGhvbiBpcyBwcm9iYWJseSBiZXR0ZXIgZm9yIG1vcmUgZ2VuZXJhbCBwdXJwb3NlIGFuZCBuYXR1cmFsIGxhbmd1YWdlIHByb2Nlc3NpbmcgdGFza3MuKQoKLSBJdCB0ZW5kcyB0byBiZSB1c2VkIGJ5IGFjYWRlbWljcyBhbmQgcmVzZWFyY2hlcnMsIHNvIGl0IHdvcmtzIHdlbGwgd2l0aCByZXNlYXJjaCBxdWVzdGlvbnMgKFB5dGhvbiwgb24gdGhlIG90aGVyIGhhbmQsIGlzIHVzZWQgbW9yZSBpbiBwcml2YXRlIGRldmVsb3BtZW50LikKCi0gU3Ryb25nIHVzZXIgY29tbXVuaXR5OiBJbiBDUkFOIChSJ3Mgb3Blbi1zb3VyY2UgcmVwb3NpdG9yeSksIGFyb3VuZCAxMjAwMCBwYWNrYWdlcyBhcmUgY3VycmVudGx5IGF2YWlsYWJsZS4KClNvbWUgbm90YWJsZSBkaXNhZHZhbnRhZ2VzOgoKLSBJdCBpcyBzbG93IHdpdGggbGFyZ2UgZGF0YSBzZXRzIChpbiB3aGljaCBjYXNlIHlvdSBtaWdodCB3YW50IGEgc2VydmVyIGluc3RhbnRpYXRpb24gb2YgUlN0dWRpbykuCgotIEl0cyBwcm9ncmFtbWluZyBwYXJhZGlnbSBpcyBzb21ld2hhdCBub24tc3RhbmRhcmQuIEl0IGlzIGJhc2VkIG9uIHRoZSBjb21tZXJjaWFsIFtTIGFuZCBTLXBsdXMgbGFuZ3VhZ2VzXShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9TXyhwcm9ncmFtbWluZ19sYW5ndWFnZSkpKS4gCgojIyMgVGhlIFIgRW52aXJvbm1lbnQgKGZvciB0aG9zZSB3aG8gYXJlIG5ldyB0byBSKQoKV2hlbiB5b3UgZmlyc3QgbGF1bmNoIFIsIHlvdSB3aWxsIHNlZSBhIGNvbnNvbGU6CgohW1IgaW1hZ2VdKGh0dHBzOi8vZGFlZGFsdXMudW1rYy5lZHUvU3RhdGlzdGljYWxNZXRob2RzL2ltYWdlcy9SLUNvbnNvbGUtMzAweDI4MC5wbmcpCgpUaGlzIGludGVyZmFjZSBhbGxvd3MgeW91IHRvIHJ1biBSIGNvbW1hbmRzIGp1c3QgbGlrZSB5b3Ugd291bGQgcnVuIGNvbW1hbmRzIG9uIGEgQmFzaCBzY3JpcHRpbmcgc2hlbGwuIAoKV2hlbiB5b3Ugb3BlbiB0aGlzIGZpbGUgaW4gUlN0dWRpbywgdGhlIGNvbW1hbmQgbGluZSBpbnRlcmZhY2UgKGxhYmVsZWQgIkNvbnNvbGUiKSBpcyBiZWxvdyB0aGUgZWRpdGluZyB3aW5kb3cuIFdoZW4geW91IHJ1biBhIGNvZGUgYmxvY2sgaW4gdGhlIGVkaXRpbmcgd2luZG93LCB5b3Ugd2lsbCBzZWUgdGhlIHJlc3VsdHMgYXBwZWFyIGluIHRoZSBDb25zb2xlIGJlbG93LgoKIyMjIEFib3V0IFIgTWFya2Rvd24KClRoaXMgaXMgYW4gW1IgTWFya2Rvd25dKGh0dHA6Ly9ybWFya2Rvd24ucnN0dWRpby5jb20pIE5vdGVib29rLiBXaGVuIHlvdSBleGVjdXRlIGNvZGUgd2l0aGluIHRoZSBub3RlYm9vaywgdGhlIHJlc3VsdHMgYXBwZWFyIGJlbmVhdGggdGhlIGNvZGUuIFRoZSByZXN1bHRzIGNhbiBhbHNvIGJlIHB1Ymxpc2hlZCBhcyBhbiBIVE1MIGZpbGUuIAoKQSBxdWljayBleGFtcGxlOiBsZXQncyBkbyBzb21lIG1hdGguIFNheSBJIGFtIG1ha2luZyBhIHRyYXZlbCBidWRnZXQsIGFuZCBJIHdhbnQgdG8gYWRkIHRoZSBjb3N0IG9mIGhvdGVsIGFuZCBmbGlnaHQgcHJpY2VzIGZvciBhIHRyaXAgdG8gU2VhdHRsZS4gVGhlIGZsaWdodCBpcyDCozU1MCBhbmQgdGhlIGhvdGVsIHByaWNlIHBlciBuaWdodCBpcyDCozEzMy4gUiBjYW4gZG8gdGhlIHdvcmsgZm9yIHlvdS4KClRyeSBleGVjdXRpbmcgdGhpcyBjaHVuayBieSBjbGlja2luZyB0aGUgKlJ1biogYnV0dG9uIHdpdGhpbiB0aGUgY2h1bmsgb3IgYnkgcGxhY2luZyB5b3VyIGN1cnNvciBpbnNpZGUgaXQgYW5kIHByZXNzaW5nICpDbWQrU2hpZnQrRW50ZXIqLiAoT24gYSBXaW5kb3dzIG1hY2hpbmUgeW91IHdvdWxkIHByZXNzICpbV2luZG93cyBidXR0b25dK1NoaWZ0K0VudGVyKi4pCgpgYGB7cn0KNTUwICsgMTMzCmBgYAoKUiBjYW4gbWFrZSBhbGwga2luZHMgb2YgY2FsY3VsYXRpb25zLCBzbyBpZiB5b3Ugd2FudCB0byBnZXQgdGhlIHRvdGFsIGNvc3Qgb2YgYSBmaXZlLWRheSB0cmlwIHRvIFNlYXR0bGUsIHlvdSBjYW4gYWRkIGFuIG9wZXJhdG9yIGZvciBtdWx0aXBsaWNhdGlvbi4KCmBgYHtyfQo1NTAgKyAxMzMgKiA1CmBgYAoKUiBpcyBhbiBvYmplY3Qtb3JpZW50ZWQgcHJvZ3JhbW1pbmcgbGFuZ3VhZ2UsIHNvIHByYWN0aWNhbGx5IGV2ZXJ5dGhpbmcgY2FuIGJlIHN0b3JlZCBhcyBhbiBSIG9iamVjdC4gVGhlIGNvbW1hbmRzIGFyZSBlaXRoZXIgKipleHByZXNzaW9ucyoqIG9yICoqYXNzaWdubWVudHMqKi4gQW4gZXhwcmVzc2lvbiAod2hlbiB3cml0dGVuIGFzIGEgY29tbWFuZCkgaXMgZXZhbHVhdGVkIGFuZCBwcmludGVkICh1bmxlc3Mgc3BlY2lmaWNhbGx5IG1hZGUgaW52aXNpYmxlKSwgYnV0IHRoZSB2YWx1ZSBpcyBsb3N0LiBBbiBhc3NpZ25tZW50IGFsc28gZXZhbHVhdGVzIGFuIGV4cHJlc3Npb24gYnV0IHRoZSB0aGUgdmFsdWUgaXMgc3RvcmVkIGluIGEgdmFyaWFibGUgKGFuZCB0aGUgcmVzdWx0IGlzIG5vdCBhdXRvbWF0aWNhbGx5IHByaW50ZWQpLgoKVG8gbWFrZSBvdXIgY2FsY3VsYXRpb25zIGVmZmVjdGl2ZSwgd2UgbmVlZCB0byBzdG9yZSB0aGVzZSBraW5kcyBvZiBjYWxjdWxhdGlvbnMgaW4gdmFyaWFibGVzLiBWYXJpYWJsZXMgY2FuIGJlIGFzc2lnbmVkIHdpdGggZWl0aGVyIGA8LWAgb3IgYD1gLiBMZXQncyBkbyB0aGF0IGJ5IGNvbXBhcmluZyB0aGUgcHJpY2Ugb2YgYSA1LWRheSB0cmlwIHRvIFNlYXR0bGUgdG8gYSA3LWRheSB0cmlwIHRvIFBhcmlzLgoKYGBge3J9CnNlYS50cmlwLnYgPC0gNTUwICsgMTMzICogNQoKcGFyaXMudHJpcC52IDwtIDExMCArIDkwICogNwoKc2VhLnRyaXAudiA8IHBhcmlzLnRyaXAudgpgYGAKCldoYXQgaXMgdGhlIG1vc3QgZXhwZW5zaXZlIHRyaXA/CgpHdWVzcyB3ZSBzaG91bGQgZ28gdG8gUGFyaXMuIFdoYXQgaWYgSSBqdXN0IHdhbnQgdG8gZG8gYm90aD8KCmBgYHtyfQpzZWEudHJpcC52ICsgcGFyaXMudHJpcC52CmBgYAoKU3VwcG9zZSBmdXJ0aGVyIHRoYXQgSSB3YW50ZWQgdG8gYWRkIGluIGFuIG9wdGlvbmFsIDMtZGF5IHRyaXAgdG8gTmV3IFlvcmsgQ2l0eS4gSSB3YW50IHRvIHNlZSB3aGljaCB0cmlwIHdvdWxkIGJlIG1vcmUgZXhwZW5zaXZlIGlmIEkgd2VyZSB0byB0YWtlIHR3byBvdXQgb2YgdGhlIHRocmVlIG9wdGlvbnMuCgpgYGB7cn0KbnljLnRyaXAudiA8LSAzMzUgKyAxNzUgKiAzCgpzZWEuYW5kLm55YyA8LSBzZWEudHJpcC52ICsgbnljLnRyaXAudiAKCnNlYS5hbmQucGFyaXMgPC0gc2VhLnRyaXAudiArIHBhcmlzLnRyaXAudgoKcGFyaXMuYW5kLm55YyA8LSBwYXJpcy50cmlwLnYgKyBueWMudHJpcC52CgpgYGAKCkFib3ZlIHlvdSBjYW4gc2VlIGhvdyBwb3dlcmZ1bCBldmVuIHNpbXBsZSBSIHByb2dyYW1taW5nIGNhbiBiZTogeW91IGNhbiBzdG9yZSBtYXRoZW10aWNhbCBvcGVyYXRpb25zIGluIG5hbWVkIHZhcmlhYmxlcyBhbmQgdXNlIHRob3NlIHZhcmlhYmxlcyB0byB3b3JrIHdpdGggb3RoZXIgdmFyaWFibGVzICh0aGlzIGJlY29tZXMgdmVyeSBpbXBvcnRhbnQgaW4gd29yZCBmcmVxdWVuY3kgY2FsY3VsYXRpb25zKS4gWW91IGNhbiBhbHNvIHBsb3QgdGhlIHJlc3VsdHMgZm9yIHF1aWNrIGFzc2Vzc21lbnQuCgpgYGB7cn0KdHJpcHMgPC0gYyhzZWEuYW5kLm55Yywgc2VhLmFuZC5wYXJpcywgcGFyaXMuYW5kLm55YykKYmFycGxvdCh0cmlwcywgeWxhYiA9ICJDb3N0IG9mIGVhY2ggdHJpcCIsIG5hbWVzLmFyZyA9IGMoIlNlYXR0bGUgYW5kIE5ZQyIsICJTZWF0dGxlIGFuZCBQYXJpcyIsICJQYXJpcyBhbmQgTllDIikpCmBgYAoKWW91IHNlZSBob3cgdGhpcyB3b3JrcywgYW5kIGhvdyBxdWlja2x5IG9uZSBjYW4gc3RvcmUgdmFyaWFibGVzIGZvciBldmVuIHByYWN0aWNhbCBxdWVzdGlvbnMuCgojIyMgUmVhZGluZyBEYXRhIGluIFIKClRoZXJlIGFyZSBvdGhlciBpbXBvcnRhbnQga2luZHMgb2YgUiBkYXRhIGZvcm1hdHMgdGhhdCB5b3Ugc2hvdWxkIGtub3cuIFRoZSBmaXJzdCBpcyBhIHZlY3Rvciwgd2hpY2ggaXMgYSBzaW5nbGUgdmFyaWFibGUgY29uc2lzdGluZyBvZiBhbiBvcmRlcmVkIGNvbGxlY3Rpb24gbnVtYmVycyBhbmQvb3Igd29yZHMuIEFuIGVhc3kgd2F5IHRvIGNyZWF0ZSBhIHZlY3RvciBpcyB0byB1c2UgdGhlIGBjYCBjb21tYW5kLCB3aGljaCBiYXNpY2FsbHkgbWVhbnMgImNvbWJpbmUuIgoKYGBge3J9CnYxIDwtIGMoImkiLCAid2FpdCIsICJ3aXRoIiwgImJhdGVkIiwgImJyZWF0aCIpCgojIGNvbmZpcm0gdGhlIHZhbHVlIG9mIHRoZSB2YXJpYWJsZSBieSBydW5uaW5nIHYxCnYxCgojIGlkZW50aWZ5IGEgc3BlY2lmaWMgdmFsdWUgYnkgaW5kaWNhdGluZyBpdCBpcyBicmFja2V0cwp2MVs0XQpgYGAKCkdldCB1c2VkIHRvIHRoZSBmdW5jdGlvbnMgdGhhdCBoZWxwIHlvdSB1bmRlcnN0YW5kIFI6IGA/YCBhbmQgYGV4YW1wbGUoKWAuCgpgYGB7cn0KP2MKCmV4YW1wbGUoYywgZWNobyA9IEZBTFNFKSAjIGNoYW5nZSB0aGUgZWNobyB2YWx1ZSB0byBUUlVFIHRvIGdldCB0aGUgcmVzdWx0cwpgYGAKClRoZSBgY2AgZnVuY3Rpb24gaXMgd2lkZWx5IHVzZWQsIGJ1dCBpdCBpcyByZWFsbHkgb25seSB1c2VmdWwgZm9yIGNyZWF0aW5nIHNtYWxsIGRhdGEgc2V0cy4gTWFueSBvZiB5b3Ugd2lsbCBwcm9iYWJseSB3YW50IHRvIGxvYWQgdGV4dCBmaWxlcy4KCltKZWZmIFJ5ZGJlcmctQ294XShodHRwczovL2RhZWRhbHVzLnVta2MuZWR1L1N0YXRpc3RpY2FsTWV0aG9kcy9wcmVwYXJpbmctbGl0ZXJhcnktZGF0YS5odG1sKSBwcm92aWRlcyBzb21lIGhlbHBmdWwgdGlwcyBmb3IgcHJlcGFyaW5nIGRhdGEgZm9yIFIgcHJvY2Vzc2luZzoKCi0gRG93bmxvYWQgdGhlIHRleHQocykgZnJvbSBhIHNvdXJjZSByZXBvc2l0b3J5LgoKLSBSZW1vdmUgZXh0cmFuZW91cyBtYXRlcmlhbCBmcm9tIHRoZSB0ZXh0KHMpLgoKLSBUcmFuc2Zvcm0gdGhlIHRleHQocykgdG8gYW5zd2VyIHlvdXIgcmVzZWFyY2ggcXVlc3Rpb25zLgoKVGhlIGJlc3Qgd2F5IHRvIGxvYWQgdGV4dCBmaWxlcyBpcyB3aXRoIHRoZSBgc2NhbmAgZnVuY3Rpb24uIChUaGUgb3RoZXIgaW1wb3J0YW50IGZ1bmN0aW9uIGlzIGByZWFkLnRhYmxlYCwgd2hpY2ggaGFuZGxlcyBjc3YgZmlsZXMuKSBGaXJzdCwgZG93bmxvYWQgYSB0ZXh0IGZpbGUgb2YgRGlja2VucydzIFsqR3JlYXQgRXhwZWN0YXRpb25zKl0oaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy9xamk5dWViNDZhamFpdDkvZ3JlYXQtZXhwZWN0YXRpb25zLnR4dD9kbD0wKSBvbnRvIHlvdXIgd29ya2luZyBkaXJlY3RvcnkuCgpgYGB7cn0KZGlja2Vucy52IDwtIHNjYW4oImdyZWF0LWV4cGVjdGF0aW9ucy50eHQiLCB3aGF0PSJjaGFyYWN0ZXIiLCBzZXA9IlxuIiwgZW5jb2RpbmcgPSAiVVRGLTgiKQpgYGAKWW91IGhhdmUgbm93IGxvYWRlZCAqR3JlYXQgRXhwZWN0YXRpb25zKiBpbnRvIGEgdmFyaWFibGUgY2FsbGVkIGBkaWNrZW5zLnZgLgoKV2l0aCB0aGUgdGV4dCBsb2FkZWQsIHlvdSBjYW4gbm93IHJ1biBxdWljayBzdGF0aXN0aWNhbCBvcGVyYXRpb25zLCBzdWNoIGFzIHRoZSBudW1iZXIgb2YgbGluZXMgYW5kIHdvcmQgZnJlcXVlbmNpZXMuCgpgYGB7cn0KbGVuZ3RoKGRpY2tlbnMudikgIyB0aGlzIGZpbmRzIHRoZSBudW1iZXIgb2YgbGluZXMgaW4gdGhlIGJvb2sKCmRpY2tlbnMubG93ZXIudiA8LSB0b2xvd2VyKGRpY2tlbnMudikgIyB0aGlzIG1ha2VzIHRoZSB3aG9sZSB0ZXh0IGxvd2VyY2FzZWQsIGFuZCBlYWNoIHNlbnRlbmNlIGlzIG5vdyBpbiBhIGxpc3QKCmRpY2tlbnMud29yZHMgPC0gc3Ryc3BsaXQoZGlja2Vucy5sb3dlci52LCAiXFxXIikgIyBzdHJzcGxpdCBpcyB2ZXJ5IGltcG9ydGFudDogaXQgdGFrZXMgZWFjaCBzZW50ZW5jZSBpbiB0aGUgbG93ZXJjYXNlZCB3b3JkcyB2ZWN0b3IgYW5kIHB1dHMgZWFjaCB3b3JkIGluIGEgbGlzdCBieSBmaW5kaW5nIG5vbi13b3JkcywgaS5lLiwgd29yZCBib3VuZGFyaWVzCiMgZWFjaCBsaXN0IGl0ZW0gKHdvcmQpIGNvcnJlc3BvbmRzIHRvIGFuIGVsZW1lbnQgb2YgdGhlIGJvb2sncyBzZW50ZW5jZXMgdGhhdCBoYXMgYmVlbiBzcGxpdC4gSW4gdGhlIHNpbXBsZXN0IGNhc2UsIHggaXMgYSBzaW5nbGUgY2hhcmFjdGVyIHN0cmluZywgYW5kIHN0cnNwbGl0IG91dHB1dHMgYSBvbmUtaXRlbSBsaXN0LgoKY2xhc3MoZGlja2Vucy53b3JkcykgIyB0aGUgY2xhc3MgZnVuY3Rpb24gdGVsbHMgeW91IHRoZSBkYXRhIHN0cnVjdHVyZSBvZiB5b3VyIHZhcmlhYmxlCgpkaWNrZW5zLndvcmRzLnYgPC0gdW5saXN0KGRpY2tlbnMud29yZHMpCgpjbGFzcyhkaWNrZW5zLndvcmRzLnYpCgpkaWNrZW5zLndvcmRzLnZbMToyMF0gIyBmaW5kIHRoZSBmaXJzdCAyMCB0ZW4gd29yZHMgaW4gR3JlYXQgRXhwZWN0YXRpb25zCmBgYAoKRGlkIHlvdSBub3RpY2UgdGhlICJcXFciIGluIHRoZSBgc3Ryc3BsaXRgIGFyZ3VtZW50PyBXaGF0IGlzIHRoYXQgYWdhaW4/IFJlZ2V4ISBOb3RpY2UgdGhhdCBpbiBSIHlvdSBuZWVkIHRvIHVzZSBhbm90aGVyIGJhY2tzbGFzaCB0byBpbmRpY2F0ZSBhIGNoYXJhY3RlciBlc2NhcGUuCgpBbHNvLCBkaWQgeW91IG5vdGljZSB0aGUgYmxhbmsgcmVzdWx0IG9uIHRoZSAxMHRoIHdvcmQ/IFRoaXMgcmVxdWlyZXMgYSBsaXR0bGUgY2xlYW4tdXAgc3RlcC4KCmBgYHtyfQpub3QuYmxhbmtzLnYgPC0gd2hpY2goZGlja2Vucy53b3Jkcy52IT0iIikKCmRpY2tlbnMud29yZHMudiA8LSBkaWNrZW5zLndvcmRzLnZbbm90LmJsYW5rcy52XQpgYGAKCkV4dHJhIHdoaXRlIHNwYWNlcyBvZnRlbiBjYXVzZSBwcm9ibGVtcyBmb3IgdGV4dCBhbmFseXNpcy4KCmBgYHtyfQpkaWNrZW5zLndvcmRzLnZbMToyMF0KYGBgCgoKVm9pbGEhIFdlIG1pZ2h0IHdhbnQgdG8gZXhhbWluZSBob3cgbWFueSB0aW1lcyB0aGUgdGhpcmQgcmVzdWx0ICJmYXRoZXIiIG9jY3VycyAodGhlIGZvdXJ0aCB3b3JkIHJlc3VsdCwgYW5kIG9uZSB0aGF0IHdpbGwgcHJvYmFibHkgYmUgYW4gaW1wb3J0YW50IHdvcmQgaW4gdGhpcyBib29rKS4KCmBgYHtyfQpsZW5ndGgoZGlja2Vucy53b3Jkcy52W3doaWNoKGRpY2tlbnMud29yZHMudj09ImZhdGhlciIpXSkKYGBgCgpPciBwcm9kdWNlIGEgbGlzdCBvZiBhbGwgdW5pcXVlIHdvcmRzLgoKYGBge3J9CnVuaXF1ZShzb3J0KGRpY2tlbnMud29yZHMudiwgZGVjcmVhc2luZyA9IEZBTFNFKSlbMTo1MF0KYGBgCgpIZXJlIHdlIGZpbmQgYW5vdGhlciBwcm9ibGVtOiB3ZSBmaW5kIGluIG91ciB1bmlxdWUgd29yZCBsaXN0IHNvbWUgb2RkIG5vbi13b3JkcyBzdWNoIGFzICIwMDM3bS4iIFdlIHNob3VsZCBzdHJpcCB0aG9zZSBvdXQuCgojIyMjIEV4ZXJjaXNlIDEKCkNyZWF0ZSBhIHJlZ3VsYXIgZXhwcmVzc2lvbiB0byByZW1vdmUgdGhvc2Ugbm9uLXdvcmRzIGluIGBkaWNrZW5zLndvcmRzLnZgPyBSZW1lbWJlciB0aGF0IHlvdSB1c2UgdHdvIGJhY2tzbGFzaGVzICgvLykgZm9yIGNoYXJhY3RlciBlc2NhcGUuIEZvciBtb3JlIGluZm9ybWF0aW9uIG9uIHVzaW5nIHJlZ2V4IGluIFIsIFJTdHVkaW8gaGFzIGEgaGVscGZ1bCBbY2hlYXQgc2hlZXRdKGh0dHBzOi8vd3d3LnJzdHVkaW8uY29tL3dwLWNvbnRlbnQvdXBsb2Fkcy8yMDE2LzA5L1JlZ0V4Q2hlYXRzaGVldC5wZGYpLgoKYGBge3J9CiMgdGhlIHBhdHRlcm4gZm9yIGZpbmRpbmcgdGhvc2UgbnVtZXJpY2FsIGJpdHMgaXMgIlxcZCtbYS16XSoiLCBvciAobW9yZSBwcmVjaXNlbHkpICJcXGR7NH1cXHcqIgojIGJ1dCBiL2MgSSBhbSBub3Qgc3VyZSBpZiBhbGwgb2YgdGhvc2UgaW5zdGFuY2VzIHN0YXJ0IHdpdGggZm91ciBudW1iZXJzIEkgd2lsbCBrZWVwIHRoZSBleHByZXNzaW9uIGZsZXhpYmxlCiMgYWxzbyB0aGUgcmVnZXggYWJvdmUgdGFrZXMgb3V0IG90aGVyIHJhbmRvbSBudW1iZXJzIC0tIGRvIHlvdSB1bmRlcnN0YW5kIHdoeT8gd2hhdCdzIHRoZSBkaWZmZXJlbmNlIGJldHdlZW4gIlxcZCtbYS16XSoiIGFuZCAiXFxkK1thLXpdIgojIG9uZSB3YXkgdG8gaGFuZGxlIHRoaXMgaXMgdG8gdXNlIGdzdWIsIGEgY29tbW9uIHJlZ2V4LXJlbGF0ZWQgZnVuY3Rpb24gaW4gUiAKCmRpY2tlbnMud29yZHMuY2xlYW4udiA8LSBnc3ViKCJcXGQrW2Etel0qIiwgIiIsIGRpY2tlbnMud29yZHMudikgCiMgZ3N1YiBpcyBhIGZ1bmN0aW9uIGZvciBzdHJpcHBpbmcgb3V0IHN0dWZmIGJ5IG1lYW5zIG9mIGEgZ2xvYmFsIHJlZ2V4IHNlYXJjaCBhbmQgcmVwbGFjZTogCiMgYWZ0ZXIgZW50ZXJpbmcgZ3N1Yiwgd2l0aGluIHRoZSBwYXJlbnRoZXNlcyB5b3UgZW50ZXIgdGhlIHJlZ2V4IGluIHF1b3RlcywgdGhlbiBpdHMgcmVwbGFjZW1lbnQgKHdoaWNoIGluIHRoaXMgY2FzZSBpcyBhIGJsYW5rKSwgdGhlbiB0aGUgdmVjdG9yIHRvIHdoaWNoIGl0IGFwcGxpZXMKYGBgCgpOb3cgbGV0J3MgcmUtcnVuIHRoYXQgbm90LmJsYW5rcyB2ZWN0b3IgdG8gc3RyaXAgb3V0IHRoZSBibGFuayB5b3UganVzdCBhZGRlZC4gCgpgYGB7cn0Kbm90LmJsYW5rcy52IDwtIHdoaWNoKGRpY2tlbnMud29yZHMuY2xlYW4udiE9IiIpCgpkaWNrZW5zLndvcmRzLmNsZWFuLnYgPC0gZGlja2Vucy53b3Jkcy5jbGVhbi52W25vdC5ibGFua3Mudl0KCnVuaXF1ZShzb3J0KGRpY2tlbnMud29yZHMuY2xlYW4udiwgZGVjcmVhc2luZyA9IEZBTFNFKSlbMTo1MF0KYGB