class: center, middle, inverse, title-slide # DSA Executive Week 19 ## Collecting and analyzing Twitter data ### Michael W. Kearney📊
School of Journalism
Informatics Institute
University of Missouri ###
@kearneymw
@mkearney
--- background-image: url(img/logo.png) background-size: 350px auto background-position: 50% 20% class: center, bottom View these slides at [dsa19talk.mikewk.com](https://dsa19talk.mikewk.com) View companion script at [dsa19talk.mikewk.com/script.Rmd](https://dsa19talk.mikewk.com/script.Rmd) View the Github repo at [github.com/mkearney/dsa19talk](https://github.com/mkearney/dsa19talk) --- # Overview 1. Crash course in {rtweet}/Twitter data + About the package + Major functions/examples 1. Case study in Twitter bot detection + tweetbotornot --- class: tight ## About {rtweet} - On Comprehensive R Archive Network (CRAN) [](https://opensource.org/licenses/MIT)[](https://cran.r-project.org/package=rtweet) - Growing base of users [](http://depsy.org/package/r/rtweet) - Fairly stable [](https://travis-ci.org/mkearney/rtweet)[](https://codecov.io/gh/mkearney/rtweet?branch=master)[](https://www.tidyverse.org/lifecycle/#maturing) - Package website: [rtweet.info](http://rtweet.info) [](http://rtweet.info/) - Github repo: [mkearney/rtweet](https://github.com/mkearney/rtweet) [](https://github.com/mkearney/rtweet/)[](https://github.com/mkearney/rtweet/) --- ## Crash course 1. Search tweets 1. Timelines 1. Favorites 1. Lookup tweets 1. Friends/followers 1. Lookup users 1. Lists 1. Stream --- ## Install - Install **{rtweet}** from [CRAN](https://cran.r-project/package=rtweet). ```r install.packages("rtweet") ``` - Or install the **development version** from [Github](https://github.com/mkearney/rtweet). ```r devtools::install_github("mkearney/rtweet") ``` - Load **{rtweet}** ```r library(rtweet) ``` --- ## httpuv To authorize rtweet's embedded **rstats2twitter** app via web browser, the **{httpuv}** pakage is required ```r ## install httpuv for browser-based authentication install.packages("httpuv") ``` --- class: inverse, center, middle # Accessing web APIs --- ## Some background **Application Program Interfaces** (APIs) are sets of protocols that govern interactions between sites and users. + APIs are similar to web browsers but with different purpose: - Web browsers render **visual content** - Web APIs manage and organize **data** + For public APIs, many sites only allow **authorized** users - Twitter, Facebook, Instagram, Github, etc. --- ## developer.twitter.com To create your own token (with write and DM-read access), users must... 1. Apply and get approved for a developer account with Twitter 1. Create a Twitter app (fill out a form) - For step-by-step instructions on how to create a Twitter app and corresponding token, see [rtweet.info/articles/auth.html](https://rtweet.info/articles/auth.html) --- # rstats2twitter The beauty of {rtweet} is that all you need is a [Twitter] username and password + A user token is generated using the embedded **rstats2twitter** app + Path to token is saved as an environment variable --- class: inverse, center, middle # Twitter Data! --- class: inverse, center, middle # 1. <br /> Searching for tweets --- ## `search_tweets()` Search for one or more keyword(s) ```r ## basic keyword search rds <- search_tweets("rstats data science") rds ``` <br> > *Note*: implicit `AND` between words --- ## `search_tweets()` Search for exact phrase ```r ## single quotes around doubles ds <- search_tweets('"data science"') ## or escape the quotes ds <- search_tweets("\"data science\"") ds ``` --- ## `search_tweets()` Search for keyword(s) **and** phrase ```r ## keyword and exact phrase search rpds <- search_tweets("rstats python \"data science\"") rpds ``` --- ## `search_tweets()` + `search_tweets()` returns 100 most recent matching tweets by default + Increase `n` to return more (tip: use intervals of 100) ```r ## increase desired number of tweets to return rstats <- search_tweets("rstats", n = 10000) rstats ``` <br> > Rate limit of 18,000 per fifteen minutes --- ## `search_tweets()` **PRO TIP #1**: Get the firehose for free by searching for tweets by verified **or** non-verified tweets ```r ## firehose hack fff <- search_tweets( "filter:verified OR -filter:verified", n = 18000) fff ``` Visualize second-by-second frequency ```r ## time series ts_plot(fff, "secs") ``` --- <p style="text-align:center"> <img src="img/fff.png" /> </p> --- ## `search_tweets()` **PRO TIP #2**: Use search operators provided by Twitter, e.g., + filter by language and exclude retweets and replies ```r ## search operators rt <- search_tweets("rstats", lang = "en", include_rts = FALSE, `-filter` = "replies") ``` + filter only tweets linking to news articles ```r ## journalism filter: nws <- search_tweets("filter:news") ``` --- ## `search_tweets()` + filter only tweets that contain links ```r ## URL filter links <- search_tweets("filter:links") links ``` + filter only tweets that contain video ```r ## video filter vids <- search_tweets("filter:video") vids ``` --- ## `search_tweets()` + filter only tweets sent `from:{screen_name}` or `to:{screen_name}` certain users ```r ## vector of screen names users <- c("cnnbrk", "AP", "nytimes", "foxnews", "msnbc", "seanhannity", "maddow") paste0("from:", users, collapse = " OR ") #> "from:cnnbrk OR from:AP OR from:nytimes OR from:foxnews OR #> from:msnbc OR from:seanhannity OR from:maddow" ## search for tweets from any of these users tousers <- search_tweets(paste0("from:", users, collapse = " OR ")) tousers ``` --- ## `search_tweets()` + filter only tweets with at least 100 favorites or 100 retweets ```r ## filter by faves/retweets pop <- search_tweets( paste("(filter:verified OR -filter:verified)", "(min_faves:100 OR min_retweets:100)")) ``` + filter by the type of device that posted the tweet. ```r ## filter by device rt <- search_tweets("lang:en", source = '"Twitter for iPhone"') ``` --- ## `search_tweets()` **PRO TIP #3**: Search by geolocation (ex: tweets within 25 miles of Columbia, MO) ```r ## filter by geolocation como <- search_tweets( geocode = "38.9517,-92.3341,25mi", n = 10000 ) como ``` --- ## `search_tweets()` Use `lat_lng()` to convert geographical data into `lat` and `lng` variables. ```r ## return/maximize and plot lat/long coord data como <- lat_lng(como) ## generate US state plot (and zoom in on Missouri) maps::map("state", fill = TRUE, col = "#ffffff", lwd = .25, mar = c(0, 0, 0, 0), xlim = c(-96, -89), y = c(35, 41)) ## plot points with(como, points(lng, lat, pch = 20, col = "red")) ``` > This code plots geotagged tweets on a map of Missouri --- <p style="text-align:center"> <img src="img/como.png" /> </p> --- ## `search_tweets()` **PRO TIP #4**: (for developer accounts only) Use `bearer_token()` to increase rate limit to 45,000 per fifteen minutes. ```r ## use bearer token for better rate limits mosen <- search_tweets( "beto OR trump OR gillibrand", n = 45000, token = bearer_token() ) ``` --- class: inverse, center, middle # 2. <br /> User timelines --- ## `get_timeline()` Get the most recent tweets posted by a user. ```r ## get tweets posted to user timeline cnn <- get_timeline("cnn") ## view time series of tweet frequency ts_plot(cnn) ``` --- ## `get_timeline()` Get up to the most recent 3,200 tweets (endpoint max) posted by multiple users. ```r ## get multiple timelines nws <- get_timeline(c("cnn", "foxnews", "msnbc"), n = 3200) ``` --- ## `ts_plot()` Group by `screen_name` and plot hourly frequencies of tweets. ```r ## plot multiple time series nws %>% dplyr::group_by(screen_name) %>% ts_plot("hours") ``` --- class: inverse, center, middle # 3. <br /> User favorites --- ## `get_favorites()` Get up to the most recent 3,000 tweets favorited by a user. ```r ## get tweets favorited by user kmw_favs <- get_favorites("kearneymw", n = 3000) ``` --- class: inverse, center, middle # 4. <br /> Lookup statuses --- ## `lookup_tweets()` ```r ## vector of status IDs status_ids <- c("947235015343202304", "947592785519173637", "948359545767841792", "832945737625387008") ## lookup status (tweet) data twt <- lookup_tweets(status_ids) ``` --- class: inverse, center, middle # 5. <br /> Getting friends/followers --- ## Friends/followers Twitter's API documentation distinguishes between **friends** and **followers**. + **Friend** refers to an account a given user follows + **Follower** refers to an account following a given user --- ## `get_friends()` Get user IDs of accounts **followed by** (AKA friends) [@jack](https://twitter.com/jack), the co-founder and CEO of Twitter. ```r ## accounts followed by a user fds <- get_friends("jack") fds ``` --- ## `get_friends()` Get friends of **multiple** users in a single call. ```r ## accounts followed by users fds <- get_friends( c("hadleywickham", "NateSilver538", "Nate_Cohn") ) fds ``` --- ## `get_followers()` Get user IDs of accounts **following** (AKA followers) [@mizzou](https://twitter.com/mizzou). ```r ## accounts following user mu <- get_followers("mizzou") mu ``` --- ## `get_followers()` Unlike friends (limited by Twitter to 5,000), there is **no limit** on the number of followers. To get user IDs of all 55(ish) million followers of @realDonaldTrump, you need two things: 1. A stable **internet** connection 1. **Time** – approximately five and a half days --- ## `get_followers()` Get all of Donald Trump's followers. ```r ## get all of trump's followers rdt <- get_followers( "realdonaldtrump", n = 56000000, retryonratelimit = TRUE ) ``` --- class: inverse, center, middle # 6. <br /> Lookup users --- ## `lookup_users()` Lookup users-level (and most recent tweet) associated with vector of `user_id` or `screen_name`. ```r ## vector of users users <- c("hadleywickham", "NateSilver538", "Nate_Cohn") ## lookup users twitter data usr <- lookup_users(users) usr ``` --- ## `search_users()` It's also possible to search for users. Twitter will look for matches in user names, screen names, and profile bios. ```r ## search for breaking news accounts bkn <- search_users("breaking news") bkn ``` --- class: inverse, center, middle # 7. <br /> Lists --- ## `lists_memberships()` + Get an account's list memberships (lists that include an account) ```r ## lists that include Nate Silver nsl <- lists_memberships("NateSilver538") nsl ``` --- ## `lists_members()` + Get all list members (accounts on a list) ```r ## all members of congress cng <- lists_members(owner_user = "cspan", slug = "members-of-congress") ## all members of the cabinet cab <- lists_members(owner_user = "cspan", slug = "the-cabinet") ``` > This actually led to an interesting piece of data science... --- <br> <p style="text-align:center"> <img src="img/oped-article.png" /> </p> --- <p style="text-align:center"> <img width="70%" src="img/oped-plot.png" /> </p> --- class: inverse, center, middle # 8. <br /> Streaming tweets --- ## `stream_tweets()` **Sampling**: small random sample (`~ 1%`) of all publicly available tweets ```r ## random (1%) stream sample ss <- stream_tweets("") ``` **Filtering**: search-like query (up to 400 keywords) ```r ## stream by keyword sf <- stream_tweets("mueller,fbi,investigation,trump,realdonaldtrump") ``` --- ## `stream_tweets()` **Tracking**: vector of user ids (up to 5000 user_ids) ```r ## user IDs from congress members (lists_members ex output) st <- stream_tweets(cng$user_id) ``` **Location**: geographical coordinates (1-360 degree location boxes) ```r ## world-wide bounding box sl <- stream_tweets(c(-180, -90, 180, 90)) ``` --- ## `stream_tweets()` The default duration for streams is thirty seconds `timeout = 30` + Specify specific stream duration in seconds ```r ## stream for 10 minutes stm <- stream_tweets(timeout = 60 * 10) ``` --- ## `stream_tweets()` Stream JSON data directly to a text file ```r ## stream text to file stream_tweets(timeout = 60 * 10, file_name = "random-stream-2018-11-13.json", parse = FALSE) ``` Read-in a streamed JSON file ```r ## parse JSON twitter data rj <- parse_stream("random-stream-2018-11-13.json") ``` --- ## `stream_tweets()` Stream tweets indefinitely. ```r ## endless stream stream_tweets(timeout = Inf, file_name = "random-stream-2018-11-13.json", parse = FALSE) ``` --- ## `lookup_coords()` A useful convenience function–though it now requires an API key–for quickly looking up coordinates ```r ## stream tweets sent from london luk1 <- stream_tweets(q = lookup_coords("London, UK"), timeout = 60) ## search tweets sent from london luk2 <- search_tweets( geocode = lookup_coords("London, UK"), n = 1000) ``` --- class: inverse, center, middle # Analyzing Twitter data --- ## Data set For these examples, let's gather a data set of iPhone and Android users ```r ## search for tweets sent from android and iphone iphone_android <- search_tweets( paste0('(filter:verified OR -filter:verified) AND \n', '(source:"Twitter for iPhone" OR source:"Twitter for Android")'), include_rts = FALSE, n = 18000 ) ## view breakdown of tweet source (device) table(iphone_android$source) ``` --- ## Text processing Tokenize tweets [into words] ```r ## tokenize each tweet into words vecotr wds <- tokenizers::tokenize_tweets(iphone_android$text) ## collapse back into strings txt <- purrr::map_chr(wds, paste, collapse = " ") ## get sentiment using afinn dictionary iphone_android$sent <- syuzhet::get_sentiment( iphone_android$text, method = "afinn" ) ``` --- ## Compare groups Group by source and summarize some numeric variables ```r ## group by device and summarise iphone_android %>% group_by(source) %>% summarise(sent = mean(sent, na.rm = TRUE), avg_rt = mean(retweet_count, na.rm = TRUE), avg_fav = mean(favorites_count, na.rm = TRUE), tweets = mean(statuses_count, na.rm = TRUE), friends = mean(retweet_count, na.rm = TRUE), followers = mean(retweet_count, na.rm = TRUE), ff_rat = (friends + 1) / (friends + followers + 1) ) ``` --- ## Features Easily automate feature extraction for Twitter data. ```r ## install package remotes::install_github("mkearney/textfeatures") ## feature extraction tf <- textfeatures::textfeatures(iphone_android) ## add dependent variable tf$y <- tweet_source_data$source == "Twitter for iPhone" ``` --- ## Machine learning Run a boosted model ```r ## use {gbm} to estimate model m1 <- gbm::gbm(y ~ ., data = tf[1:15000, -1], n.trees = 200) #summary(m1) ## generate predictions p <- predict(m1, newdata = tf[15001:nrow(tf), -1], type = "response", n.trees = 200) ## how'd we do? table(p > .50, tf$y[15001:nrow(tf)]) ``` --- ## {xgboost} ```r ## randomly sample 80% of rows in x (model matrix) train_rows <- sample(seq_len(nrow(x)), ceiling(nrow(x) * .8)) ## select rows not sampled [in train_rows] test_rows <- tfse::nin(seq_len(nrow(x)), train_rows) ## set eta (learn rate) and nrounds (number of trees) m1 <- xgboost::xgboost( x[train_rows, ], label = y[train_rows], eta = .01, nrounds = 1200) ``` --- ## {xgboost} cont'd ```r ## view predictor influence xgboost::xgb.importance(model = xgboost::xgb.Booster.complete(m1)) ## get predictions for test data pred <- predict(m1, newdata = x[test_rows, ]) ``` --- class: inverse, center, middle # Case study --- # Problem: - Use of bots on social media to manipulate public opinions, amplify social divisions, and spread "fake news" - **How do we accurately and dynamically classify bots on social media?** --- # 2016 election Concerns about **automated accounts, or bots, on social media** manipulating public opinion reached a fever pitch during the 2016 general election. The most alarming concerns were related to behaviors of **Kremlin-linked bots** for: + pushing **fake news** stories + amplifying **social divisions** + misrepresenting **public opinion** --- # Research question(s) ## How do we **identify bots on social media**? ## How can we do it in **real-time** so as to filter out inauthentic social media traffic? --- class: inverse, center, middle # Social media data --- # Social media apps There's been a lot of talk about Cambridge Analytica's abuse of social media data. **Reactions** can be summarized as follows: 1. **Facebook was negligent** because it inadequately regulated third party apps 1. **Cambridge Analytica was unethical** because it stole and used data they shouldn't have > In other words...people don't know exactly what FB and CA actually did, but they know it was wrong. --- # Social media APIs **Application Program Interfaces** (APIs) are sets of protocols that govern interactions between sites and users Similar to web browsers but with different primary objective: + Web browsers **render content** + Web APIs **organize data** For public APIs, many sites only allow access to **authorized** users + Twitter, Facebook, Instagram, Github, etc. --- # Twitter vs. Facebook For a case study about political bots on social media, Twitter is better because... 1. It's more open about its **data** + Unlike Facebook, which actually clamped down on data sharing (with developers) starting in 2014, **Twitter is quite open with its data** 1. It's more **public facing** + **Default privacy setting** is public + User **connections** are asymmetrical --- class: inverse, center, middle # Identifying bots --- # Current approach The creators of [**botometer**](https://botometer.iuni.iu.edu/#!/) maintain a list open sourced academic studies that identified bots. But this approach is not without **limitations**: + Academic research moves slowly (especially with Twitter banning many of the bots) + Relatively small number have been collected + Labor intensive methods (human coding) --- # New approach One **solution** to these limitations is to combine the academic sources with an easy-to-automate method that takes advantage of naturally occurring human coding 1. Select a handful of previously **verified** bots 1. Look up the public **lists** that include those bots 1. Identify list **names** that self identify as bot lists - Perform validity checks on other listed accounts --- class: inverse, center, middle # Compiling data sets --- # User/tweet data Using the list of screen names for non-bot (based on previous academic research and human coding) and bot accounts, I gathered... + **tweet-level data** for the most recent 100 tweets posted by each user `rtweet::get_timelines()` + **user-level (or account-level) data** for each user `rtweet::lookup_users()` The data returned by Twitter consists of over 80 features related to the tweet and author (user) of each tweet. --- # Data sets Randomly sample Twitter data into train and test data sets + **Train data**: approx. 70% of users in sample - Making sure there are a roughly equal number of bots and non-bots + **Test data**: remaining (30%) bot and non-bot accounts --- # Features In addition to the numeric variables already returned by Twitter, I used **{textfeatures}** to extract up to 26 features for each textual variable. The text features were extracted for: + Tweets + Name + Screen name + Location + Description (bio) --- class: inverse, center, middle # Modeling --- # Two models Two models were created. 1. The **default model**, which uses both users and tweets data. - Rate limited by Twitter to 180 every 15 minutes 1. The **fast model**, which uses only users data - Rate limited by Twitter to 90,000 every 15 minutes --- # Model results A gradient boosted logistic model **{gbm::gbm()}** was trained and tested on the two data sets for both the default and fast models. + The **default model** was 93.53% accurate when classifying bots and 95.32% accurate when classifying non-bots + The **fast model** was 91.78% accurate when classifying bots and 92.61% accurate when classifying non-bots Overall... + The **default model** was correct 93.8% of the time + The **fast model** was correct 91.9% of the time. --- class: inverse, center, middle # tweetbotornot --- # Applications + This has since been assembled into an R package [**{tweetbotornot}**](https://github.com/mkearney/tweetbotornot)---coming soon to CRAN! + It has also been exported as a [Shiny web app](https://mikewk.shinyapps.io/botornot) --- ## Tweetbotornot A package designed to estimate the probability of an account being a bot. ```r ## install from Github remotes::install_github("mkearney/tweetbotornot") ## estimate some accounts bp <- tweetbotornot::tweetbotornot(c( "kearneymw", "realdonaldtrump", "netflix_bot", "tidyversetweets", "thebotlebowski") ) bp ``` --- class: inverse, center, middle # Future directions --- # Future research Between the proliferation of news stories about Kremlin-linked bots on social media and the number of false positives reported to me about **tweetbotornot**, I'm starting to think... The real challenge is not **identifying** social media bots but instead **defining** what is means to be a *bot* on social media + Future research should examine how and why people define certain accounts as bots and others as not bots --- class: inverse, center, middle # The end