A recent post on AEA365, plus my Evaluation Twitter working group, inspired me to finally learn how to scrape tweets in R! The AEA365 post linked to a tutorial on how to get started, which was helpful at first. However, I ran into an issue where only some of my most recent tweets were being scraped and not all of them. I ended up having to pull six waves of data and using the maxID function to grab the entire year’s worth of tweets. I then combined the dataframe and wrote the dataframe to a CSV for further analysis. I spent some time figuring out how to grab the most frequently used terms (this guide was handy) before I gave up and did everything else in Excel.
Overall, I had 996 tweets in 2017. I had about 1,071 mentions, 6,892 visits, and I was going to tell you how many new followers I have but Twitter analytics says 484 (with May having 258!) and I only have 425 so I have a feeling it’s including people who stopped following me. What happened in May?! I have no idea. Maybe some bot service found me and followed/unfollowed me? Who knows.
Most Frequently Used Terms
Not surprisingly, my most frequently used term on Twitter was “eval.” I think next I’ll analyze the #eval hashtag, but I’ll save that for another day. Also, I mention a lot of people (or maybe they are replies?) so those people are highlighted in red.
My Most Liked/Retweeted Tweets
Unfortunately, my most liked and retweeted tweet had absolutely nothing to do with eval, but it was pretty fun and exciting to have it “blow up” like it did.
— Dana Linnell Wanzer (@danawanzer) December 23, 2017
Otherwise, here are some of my other most liked and/or retweeted tweets:
— Dana Linnell Wanzer (@danawanzer) November 8, 2017
My #eval17 presentation on surveying children has been by far my most successful presentation yet. Positive feedback, the room wasn't mostly empty, and multiple people have asked for slides/resources (which you can download here: https://t.co/dLIFnAaSeQ )
— Dana Linnell Wanzer (@danawanzer) December 20, 2017
— Dana Linnell Wanzer (@danawanzer) December 30, 2017
As a researcher and evaluator, I greatly concur. Evaluators know how to work with practitioners, it's their job https://t.co/OylbRGUUT4
— Dana Linnell Wanzer (@danawanzer) July 11, 2017
— Dana Linnell Wanzer (@danawanzer) December 20, 2017
— Dana Linnell Wanzer (@danawanzer) March 13, 2017
Overall, this was a fun little exercise! What else should I analyze?
If you are interested in my messy R code, here it is. The scraping was fairly straight forward, but cleaning up the text is something I had never done before and subsequently the code could probably be cleaned up.
#Load required packages library(stringr) library(twitteR) library(purrr) library(tidytext) library(dplyr) library(tidyr) library(lubridate) library(scales) library(broom) library(ggplot2) #Get access to Twitter. #Instructions here: http://www.interhacktives.com/2017/01/25/scrape-tweets-r-journalists/ consumerKey = "INSERT" consumerSecret = "INSERT" accessToken = "INSERT" accessSecret = "INSERT" options(httr_oauth_cache=TRUE) setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret, access_token = accessToken, access_secret = accessSecret) # Scrape tweets danatweets1 <- userTimeline("danawanzer", n = 3200) danatweets1_df <- tbl_df(map_df(danatweets1, as.data.frame)) danatweets2 <- userTimeline("danawanzer", n = 3200, maxID = 928590693353869000) danatweets2_df <- tbl_df(map_df(danatweets2, as.data.frame)) danatweets3 <- userTimeline("danawanzer", n = 3200, maxID = 895516853119758337) danatweets3_df <- tbl_df(map_df(danatweets3, as.data.frame)) danatweets4 <- userTimeline("danawanzer", n = 3200, maxID = 870360671313043456) danatweets4_df <- tbl_df(map_df(danatweets4, as.data.frame)) danatweets5 <- userTimeline("danawanzer", n = 3200, maxID = 851081522169864192) danatweets5_df <- tbl_df(map_df(danatweets5, as.data.frame)) danatweets6 <- userTimeline("danawanzer", n = 3200, maxID = 846499061079289856) danatweets6_df <- tbl_df(map_df(danatweets6, as.data.frame)) danatweets7 <- userTimeline("danawanzer", n = 3200, maxID = 836019959239094273) danatweets7_df <- tbl_df(map_df(danatweets7, as.data.frame)) danatweets <- rbind(danatweets1_df, danatweets2_df, danatweets3_df, danatweets4_df, danatweets5_df, danatweets6_df) write.csv(danatweets, "danatweets.csv") # Most common words myCorpus <- Corpus(VectorSource(danatweets$text)) removeURL <- function(x) gsub("http[^[:space:]]*", "", x) myCorpus <- tm_map(myCorpus, content_transformer(removeURL)) removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct)) myStopwords <- setdiff(myCorpus, stopwords(kind = "en")) myCorpus <- tm_map(myCorpus, content_transformer(tolower)) myCorpus <- tm_map(myCorpus, removeWords, myStopwords) tdm <- TermDocumentMatrix(myCorpus, control = list(removePunction = TRUE, stopwords = TRUE)) freq.terms <- findFreqTerms(tdm, lowfreq = 20) term.freq <- rowSums(as.matrix(tdm)) term.freq <- subset(term.freq, term.freq >= 20) df <- data.frame(term=names(term.freq), freq = term.freq) df$term <- factor(df$term, levels = df$term[order(df$freq)]) write.csv(df, "freqterms.csv") ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") + xlab("Terms") + ylab("Count") + coord_flip() + theme(axis.text=element_text(size=7)) #what words are associated with eval findAssocs(tdm, "eval", .2)