Mario - blog

Disclaimer: The analysis below is provided with an unbiased view and the opinion expressed below is mine and do not represent the opinions of any entity whatsoever with which I have been, am now, or will be affiliated.

Introduction

It has been a while since my last post but today I will deep dive into Mueller Report using the package tidytext. Also, credits to Aditya Mangal for his initial exploratory analysis article posted in Medium that encouraged me to take it further. I recommend checking his article before reading mine.

Environment Setup

Let’s load all of the libries needed for analysis.

library(pacman)
pacman::p_load(tidyverse,
               tidytext,
               ggplot2,
               scales,
               data.table,
               hunspell,
               wordcloud,
               reshape2,
               topicmodels)

Load the Data (Mueller Report)

Here is the PDF from the Justice Department if anyone is interested in reading it. I will use the pre-converted CSV format of the report from here.

Cleaning the data

I used the same data cleaning approach as Aditya on starting a few pages in due to failure of PDF to text parsing and drop the lines which have the majority of words misspelled using the package hunspell in the data with several modifications.

report <- data.table::fread("https://github.com/gadenbuie/mueller-report/raw/36fbb136a2a508c812db8773e9342b7a55204b20/mueller_report.csv",
                            data.table = FALSE)

content <- report %>%
            filter(page >= 9, !is.na(text)) %>%
            rowwise() %>%
            mutate(num_mispelled_words = length(hunspell(text)[[1]]),
                   num_words = length(str_split(text, " ")[[1]]),
                   perc_misspelled = num_mispelled_words/num_words) %>%
            filter(perc_misspelled <= 0.5) %>%
            select(-num_mispelled_words, -num_words)

Then, I normalize the lines using tidytext. Notice that I replaced a lot of misspelling words manually due to the strikethrough on the top of documents on sentence such as Attorney Work Product // May Contain Material Protected Under Fed. R. Crim. P. 6(e)

content <- content %>% 
  unnest_tokens(text, text, token = "lines") %>%
  mutate(text = stringr::str_replace_all(text, c("uhtierfeti" = "under fed", 
                                                 "uhaef" = "under",
                                                 "uheer" = "under",
                                                 "uncler" = "under",
                                                 "uhder" = "under",
                                                 "pretecteti" = "protected",
                                                 "proteetecl" = "protected",
                                                 "prnteetee" = "protected",
                                                 "pfoettet" = "protected",
                                                 "proteetee" = "protected",
                                                 "proteeted" = "protected",
                                                 "ma:tefittlproteetea" = "material protected",
                                                 "cehtaihmaterial" = "contain material",
                                                 "mttterittl" = "material",
                                                 "prodttet" = "product",
                                                 "preettet" = "product",
                                                 "proauet" = "product",
                                                 "fecl" = "fed",
                                                 "fca" = "fed",
                                                 "cofltttifl" = "contain",
                                                 "cet'ttaih" = "contain",
                                                 "cohtaih" = "contain",
                                                 "amorttcy" = "attorney",
                                                 "atteme" = "attorney",
                                                 "attorhe" = "attorney",
                                                 "weft" = "work",
                                                 "werle" = "work",
                                                 "depart ment" = "department",
                                                 "mtty" = "may"
  )))

Most Popular Words

This is the exact replica of what Aditya did on his analysis since I thought it would be interesting for those who haven’t read his article yet.

tidy_content <- content %>%
                  unnest_tokens(word, text) %>%
                  anti_join(stop_words)

tidy_content %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  filter(!is.na(word)) %>% 
  count(word, sort = TRUE) %>%
  filter(str_length(word) > 1, n > 400) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_segment(aes(x=word, xend=word, y=0, yend=n), color="skyblue", size = 1) +
  geom_point(color="blue", size = 4, alpha = 0.6) +
  coord_flip() +
  theme_bw() +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank(),
        legend.position="none") +
  labs(x = "",
       y = "Number of Occurences",
       title = "Most popular words from the Mueller Report",
       subtitle = "Words Occuring more than 400 times",
       caption = "Based on data from the Mueller Report")

For those who have been following the news, the most popular words shown from the plot above such as trump, campaign, russian, cohen, russia are not suprising at all.

Sentiment Analysis

One might expect that the sentiment of the document is mostly negative but let’s use the three general-purpose lexicons for comparison: AFINN, bing, and nrc.

The nrc lexicon categorizes words in a binary setting into categories of positive, negative, anger, anticipation, etc. The bing lexicon categorizes into either positive or negative categories. The AFINN lexicon assigns words with a score between -5 and 5 indicating most negative to most positive sentiments.

The code below compares all of the 3 lexicons split by the index counts up of 20 pages of text since the individual page might not have enough words in them due to the words on the document are mostly redacted. Now, we plot the sentiment scores across the index and type of lexicon methods used to observe its trajectory.

We see similar dips and peaks in the sentiment of each lexicon methods, however, we can see subtle differences where NRC sentiment is high, the Bing et al. sentiment appears to find longer stretches of similar text, and the AFINN sentiment has more variance.

afinn <- tidy_content %>%
          inner_join(get_sentiments("afinn")) %>%
          group_by(index = page %/% 20) %>%
          summarise(sentiment = sum(score)) %>%
          mutate(method = "AFINN")

bing_and_nrc <- bind_rows(tidy_content %>%
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          tidy_content %>%
                            inner_join(get_sentiments("nrc") %>%
                                        filter(sentiment %in% c("positive", "negative"))) %>%
                            mutate(method = "NRC")) %>%
                count(method, index = page %/% 20, sentiment) %>%
                spread(sentiment, n, fill = 0) %>%
                mutate(sentiment = positive - negative)

bind_rows(afinn,
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill=method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  theme_bw()

Most common positive and negative words

Let’s use the Bing et al. lexicon to get the most common positive and negative words on the document. From the plot below, we see several interesting negative words such as obstruction and interference and positive words such as pardon and loyalty.

bing_word_counts <- tidy_content %>%
                      inner_join(get_sentiments("bing")) %>%
                      count(word, sentiment, sort = TRUE) %>%
                      ungroup()

bing_word_counts %>% 
  group_by(sentiment) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() +
  theme_bw()

Wordclouds

What about the good ol’ word clouds? Let’s visualize word that has either positive or negative sentiments using the package wordcloud which aligned with what we have when using Bing et al. approach shown above.

tidy_content %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#D95F02", "#1B9E77"),
                   max.words = 100)

Which page of the document has the highest negative words?

I don’t know about you but I am most interested in a page that contains the highest negative words ratio (negative words/total # words on the page). Again, I utilized bing lexicon to get words that contain negative sentiment and filter to the page that has more than 100 words on it.

Here is what I found by observing the top 5 page that has the highest negative words ratio on the actual PDF:

Page 199 (page 191 of Volume I of II on the actual report): Half of the page is redacted but there are a lot of negative words such as false, fictitious, fraudulent, etc
Page 220 (page 2 of Volume II of II on the actual report): Looks to be an introduction of Volume II of II with interesting read
Page 61 (page 53 of Volume I of II on the actual report): 3/4 of the page is redacted and mostly talked about Cohen and Manafort
Page 60 (page 52 of Volume I of II on the actual report): Most of the page is redacted but has some mention of Assange from WikiLeaks
Page 369 (page 157 of Volume II of II on the actual report): Contains a lot of negative words such as obstruction, criminal, discouragement, etc

bingnegative <- get_sentiments("bing") %>%
                  filter(sentiment == "negative")

wordcounts <- tidy_content %>%
              group_by(page) %>%
              summarise(words = n())

tidy_content %>%
  semi_join(bingnegative) %>%
  group_by(page) %>%
  summarise(negativewords = n()) %>%
  left_join(wordcounts, by = c("page")) %>%
  mutate(ratio = negativewords/words) %>%
  ungroup() %>%
  filter(words >= 100) %>%
  arrange(desc(ratio)) %>%
  top_n(5)

## # A tibble: 5 x 4
##    page negativewords words ratio
##   <int>         <int> <int> <dbl>
## 1   199            24   193 0.124
## 2   220            34   288 0.118
## 3    61            13   119 0.109
## 4    60            23   212 0.108
## 5   369            35   327 0.107

TFIDF (Term Frequency Inverse Document Frequency)

Without going into too much detail, the definition of tf-idf provided by Wikipedia is great. tfidf is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.

tidy_content %>%
  count(word, page,  sort = TRUE) %>%
  bind_tf_idf(word, page, n) %>%
  arrange(desc(tf_idf)) %>%
  top_n(15) %>%
  mutate(word_page = paste(word, page, sep = "_")) %>%
  mutate(word_page = reorder(word_page, tf_idf)) %>%
  ggplot(aes(word_page, tf_idf)) +
  geom_col(show.legend = FALSE) +
  labs(x = "", y = "tf-idf of term and page#") +
  coord_flip() +
  theme_bw()

The plot above shows the top 15 terms by page number (i.e. ongoing_65 indicated the term ongoing on page 65) with its tf-idf. It seems that the term harm, ongoing, matter, and protected have relatively high tf-idf.

Bigram

Notice that all of the analysis that we looked at so far contains one word such as harm, trump, etc. While one word is very simple to understand, it doesn’t give us a lot of contexts on what the word stands for. With that in mind, let’s try to look at two consecutive words which are known as a bigram.

bigram_tf_idf <- content %>%
                  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
                  separate(bigram, c("word1", "word2"), sep = " ") %>%
                  filter(!word1 %in% stop_words$word) %>%
                  filter(!word2 %in% stop_words$word) %>%
                  unite(bigram, word1, word2, sep = " ") %>%
                  count(page, bigram) %>%
                  bind_tf_idf(bigram, page, n) %>%
                  arrange(desc(tf_idf))

bigram_tf_idf %>%
  top_n(15) %>%
  mutate(bigram_page = paste(bigram, page, sep = "_")) %>%
  mutate(bigram_page = reorder(bigram_page, tf_idf)) %>%
  mutate(bigram_page = reorder(bigram_page, tf_idf)) %>%
ggplot(aes(x = bigram_page, y = tf_idf)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf of bigram by page") +
  coord_flip() +
  theme_bw()

The interesting result of the above plot is that most of the redacted pages have ongoing matter on top of the black rectangles. While prosecutorial judgment seems to appear a lot in the introduction and conclusion section of the actual PDF report.

Topic modeling

Although the Mueller report is considered as one topic, let’s try to see if we can find terms that we can separate into 2 or more topics. For the sake of simplicity, we will choose 2 topics. Topic modeling is the unsupervised classification of documents (think of it as clustering on numeric data) that finds natural groups of the terms in our case.

One of the most popular methods for fitting a topic model is Latent Dirichlet Allocation (LDA). It treats each document as a mixture of topics, and each topic as a mixture of words. In our case, a page will be considered as a document.

In order to use the LDA function provided in the package topicmodels, we have to convert the dataframe to a Document Term Matrix (DTM) format as shown below.

dtm_content <- tidy_content %>%
  count(word, page, sort = TRUE) %>%
  rename(count = n) %>% 
  cast_dtm(page, word, count)

Next, we run LDA in our newly converted Document Term Matrix and specified k = 2 topics. Notice that we used the tidy() function to extract the per-topic-per-word probabilities, called “beta”. Then, the plot shows the term by beta value based on 2 topics. The terms appeared on the first topic seem to be the center of the Mueller Report: president, trump, campaign, manafort, cohen, 2016, justice, russian, and office. If you are up to date with news, that shouldn’t be unfamiliar to you.

The terms on the second topic seem to indicate president, trump, and series of number. 302 appeared a lot on the footnote while I couldn’t find much information on the other numbers from the good old (ctrl/cmd + f) on a PDF document. Do let me know if you happen to find any interesting things from those numbers.

lda_model <- LDA(dtm_content, k = 2, control = list(seed = 1234))

lda_topics <- tidy(lda_model, matrix = "beta")

lda_top_terms <- lda_topics %>%
                  group_by(topic) %>%
                  top_n(10, beta) %>%
                  ungroup() %>%
                  arrange(topic, -beta)

lda_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  theme_bw()

Let’s try a different approach by finding the terms that has the greatest difference in beta between topic 1 and topic 2 by utilizing the log ratio of the two. Why log two you asked? The difference is symmetrical. The log ratio of 1 is equivalent to the beta of topic 2 is twice as large while the log ratio of -1 is equivalent to the beta of topic 1 is twice as large.

beta_spread <- lda_topics %>%
                mutate(topic = paste0("topic", topic)) %>%
                spread(topic, beta) %>%
                filter(topic1 > .001 | topic2 > .001) %>%
                mutate(log_ratio = log2(topic2/topic1)) %>%
                mutate(direction_ind = ifelse(log_ratio > 0, 1, 0))

rbind(beta_spread %>%
        filter(direction_ind == 1) %>%
        arrange(desc(log_ratio)) %>%
        head(15),
      beta_spread %>%
        filter(direction_ind == 0) %>%
        arrange(log_ratio) %>%
        head(15)) %>% 
     mutate(term = reorder(term, log_ratio)) %>%
ggplot(aes(x = term, y = log_ratio)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  theme_bw() +
  labs(y = "Log2 ratio of beta in topic 2 / topic 1")

Now, let’s compare the terms on these 2 topics. We can see that the topic 2 has a lot of terms related to Russia and Trump’s administration official such as Kirill Dmitriev, Rick Gerson, Erik Prince, Stephen Miller, Reince Priebus, Steve Bannon, etc. While the topic 1 contains terms such as statutes, jury, intent, court, constitutional, corrupt, etc. This approach actually yields quite an interesting result that I didn’t expect to find.

Conclusion and Learning

I hope you find the analysis to be interesting and feel free to reach out with any question/suggestion. I have provided the link to the entire code on the resources section. Thanks for reading!

Analysis on Mueller Report