DSA Executive Week 19

class: center, middle, inverse, title-slide

# DSA Executive Week 19
## Collecting and analyzing Twitter data
### Michael W. Kearney📊 School of Journalism Informatics Institute University of Missouri
### <table style="border-style:none;padding-top:30px;" class=".table">
<tr>
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> </a>
</th>
</tr>
<tr style="background-color:#fff">
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> @kearneymw </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> @mkearney </a>
</th>
</tr>
</table>

---

background-image: url(img/logo.png)
background-size: 350px auto
background-position: 50% 20%
class: center, bottom

View these slides at [dsa19talk.mikewk.com](https://dsa19talk.mikewk.com)

View companion script at [dsa19talk.mikewk.com/script.Rmd](https://dsa19talk.mikewk.com/script.Rmd)

View the Github repo at [github.com/mkearney/dsa19talk](https://github.com/mkearney/dsa19talk)

---

# Overview

1. Crash course in {rtweet}/Twitter data
   + About the package
   + Major functions/examples

1. Case study in Twitter bot detection
   + tweetbotornot

---
class: tight

## About {rtweet}

- On Comprehensive R Archive Network (CRAN)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)[![CRAN status](https://www.r-pkg.org/badges/version/rtweet)](https://cran.r-project.org/package=rtweet)

- Growing base of users

![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/rtweet)[![Depsy](http://depsy.org/api/package/cran/rtweet/badge.svg)](http://depsy.org/package/r/rtweet)

- Fairly stable

[![Build Status](https://travis-ci.org/mkearney/rtweet.svg?branch=master)](https://travis-ci.org/mkearney/rtweet)[![Coverage Status](https://codecov.io/gh/mkearney/rtweet/branch/master/graph/badge.svg)](https://codecov.io/gh/mkearney/rtweet?branch=master)[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)

- Package website: [rtweet.info](http://rtweet.info)

[![Generic badge](https://img.shields.io/badge/website-up-green.svg)](http://rtweet.info/)

- Github repo: [mkearney/rtweet](https://github.com/mkearney/rtweet)

[![Generic badge](https://img.shields.io/badge/stars-313-blue.svg)](https://github.com/mkearney/rtweet/)[![Generic badge](https://img.shields.io/badge/forks-74-blue.svg)](https://github.com/mkearney/rtweet/)

---

## Crash course
1. Search tweets
1. Timelines
1. Favorites
1. Lookup tweets
1. Friends/followers
1. Lookup users
1. Lists
1. Stream

---

## Install

- Install **{rtweet}** from [CRAN](https://cran.r-project/package=rtweet).

```r
install.packages("rtweet")
```

- Or install the **development version** from [Github](https://github.com/mkearney/rtweet).

```r
devtools::install_github("mkearney/rtweet")
```

- Load **{rtweet}**

```r
library(rtweet)
```

---

## httpuv

To authorize rtweet's embedded **rstats2twitter** app via web browser, the **{httpuv}** pakage is required

```r
## install httpuv for browser-based authentication
install.packages("httpuv")
```

---
class: inverse, center, middle

# Accessing web APIs

---

## Some background

**Application Program Interfaces** (APIs) are sets of protocols that govern interactions between sites and users.

+ APIs are similar to web browsers but with different purpose:
   - Web browsers render **visual content**
   - Web APIs manage and organize **data**
+ For public APIs, many sites only allow **authorized** users
   - Twitter, Facebook, Instagram, Github, etc.

---

## developer.twitter.com

To create your own token (with write and DM-read access), users must...

1. Apply and get approved for a developer account with Twitter
1. Create a Twitter app (fill out a form)
   - For step-by-step instructions on how to create a Twitter app and corresponding token, see 
   [rtweet.info/articles/auth.html](https://rtweet.info/articles/auth.html)

---

# rstats2twitter

The beauty of {rtweet} is that all you need is a [Twitter] username and password

+ A user token is generated using the embedded **rstats2twitter** app
+ Path to token is saved as an environment variable

---
class: inverse, center, middle

# Twitter Data!

---
class: inverse, center, middle

# 1. Searching for tweets

---

## `search_tweets()`

Search for one or more keyword(s)

```r
## basic keyword search
rds <- search_tweets("rstats data science")
rds
```

> *Note*: implicit `AND` between words

---

## `search_tweets()`

Search for exact phrase

```r
## single quotes around doubles
ds <- search_tweets('"data science"')

## or escape the quotes
ds <- search_tweets("\"data science\"")
ds
```

---

## `search_tweets()`

Search for keyword(s) **and** phrase

```r
## keyword and exact phrase search
rpds <- search_tweets("rstats python \"data science\"")
rpds
```

---

## `search_tweets()`

+ `search_tweets()` returns 100 most recent matching tweets by default

+ Increase `n` to return more (tip: use intervals of 100)

```r
## increase desired number of tweets to return
rstats <- search_tweets("rstats", n = 10000)
rstats
```

> Rate limit of 18,000 per fifteen minutes

---

## `search_tweets()`

**PRO TIP #1**: Get the firehose for free by searching for tweets by
verified **or** non-verified tweets

```r
## firehose hack
fff <- search_tweets(
 "filter:verified OR -filter:verified", n = 18000)
fff
```

Visualize second-by-second frequency

```r
## time series
ts_plot(fff, "secs")
```

---

---

## `search_tweets()`

**PRO TIP #2**: Use search operators provided by Twitter, e.g.,

+ filter by language and exclude retweets and replies

```r
## search operators
rt <- search_tweets("rstats", lang = "en", 
 include_rts = FALSE, `-filter` = "replies")
```

+ filter only tweets linking to news articles

```r
## journalism filter:
nws <- search_tweets("filter:news")
```

---

## `search_tweets()`

+ filter only tweets that contain links

```r
## URL filter
links <- search_tweets("filter:links")
links
```

+ filter only tweets that contain video

```r
## video filter
vids <- search_tweets("filter:video")
vids
```

---

## `search_tweets()`

+ filter only tweets sent `from:{screen_name}` or `to:{screen_name}` certain users

```r
## vector of screen names
users <- c("cnnbrk", "AP", "nytimes", 
 "foxnews", "msnbc", "seanhannity", "maddow")
paste0("from:", users, collapse = " OR ")
#> "from:cnnbrk OR from:AP OR from:nytimes OR from:foxnews OR 
#> from:msnbc OR from:seanhannity OR from:maddow"

## search for tweets from any of these users
tousers <- search_tweets(paste0("from:", users, collapse = " OR "))
tousers
```

---

## `search_tweets()`

+ filter only tweets with at least 100 favorites or 100 retweets

```r
## filter by faves/retweets
pop <- search_tweets(
 paste("(filter:verified OR -filter:verified)",
 "(min_faves:100 OR min_retweets:100)"))
```

+ filter by the type of device that posted the tweet.

```r
## filter by device
rt <- search_tweets("lang:en", source = '"Twitter for iPhone"')
```

---

## `search_tweets()`

**PRO TIP #3**: Search by geolocation (ex: tweets within 25 miles of Columbia, MO)

```r
## filter by geolocation
como <- search_tweets(
 geocode = "38.9517,-92.3341,25mi", n = 10000
)
como
```

---

## `search_tweets()`

Use `lat_lng()` to convert geographical data into `lat` and `lng` variables.

```r
## return/maximize and plot lat/long coord data
como <- lat_lng(como)

## generate US state plot (and zoom in on Missouri)
maps::map("state", fill = TRUE, col = "#ffffff", 
  lwd = .25, mar = c(0, 0, 0, 0), 
  xlim = c(-96, -89), y = c(35, 41))

## plot points
with(como, points(lng, lat, pch = 20, col = "red"))
```

> This code plots geotagged tweets on a map of Missouri

---

---

## `search_tweets()`

**PRO TIP #4**: (for developer accounts only) Use `bearer_token()` to increase rate limit to 45,000 per
fifteen minutes.

```r
## use bearer token for better rate limits
mosen <- search_tweets(
 "beto OR trump OR gillibrand", 
 n = 45000, 
 token = bearer_token()
)
```

---
class: inverse, center, middle

# 2. User timelines

---

## `get_timeline()`

Get the most recent tweets posted by a user.

```r
## get tweets posted to user timeline
cnn <- get_timeline("cnn")

## view time series of tweet frequency
ts_plot(cnn)
```

---

## `get_timeline()`

Get up to the most recent 3,200 tweets (endpoint max) posted by multiple users.

```r
## get multiple timelines
nws <- get_timeline(c("cnn", "foxnews", "msnbc"), n = 3200)
```

---

## `ts_plot()`

Group by `screen_name` and plot hourly frequencies of tweets.

```r
## plot multiple time series
nws %>%
  dplyr::group_by(screen_name) %>%
  ts_plot("hours")
```

---
class: inverse, center, middle

# 3. User favorites

---

## `get_favorites()`

Get up to the most recent 3,000 tweets favorited by a user.

```r
## get tweets favorited by user
kmw_favs <- get_favorites("kearneymw", n = 3000)
```

---
class: inverse, center, middle

# 4. Lookup statuses

---

## `lookup_tweets()`

```r
## vector of status IDs
status_ids <- c("947235015343202304", "947592785519173637",
 "948359545767841792", "832945737625387008")

## lookup status (tweet) data
twt <- lookup_tweets(status_ids)
```

---
class: inverse, center, middle

# 5. Getting friends/followers

---

## Friends/followers

Twitter's API documentation distinguishes between **friends** and **followers**.

+ **Friend** refers to an account a given user follows
+ **Follower** refers to an account following a given user

---

## `get_friends()`

Get user IDs of accounts **followed by** (AKA friends) [@jack](https://twitter.com/jack), the co-founder and CEO of Twitter.

```r
## accounts followed by a user
fds <- get_friends("jack")
fds
```

---

## `get_friends()`

Get friends of **multiple** users in a single call.

```r
## accounts followed by users
fds <- get_friends(
 c("hadleywickham", "NateSilver538", "Nate_Cohn")
)
fds
```

---

## `get_followers()`

Get user IDs of accounts **following** (AKA followers) [@mizzou](https://twitter.com/mizzou).

```r
## accounts following user
mu <- get_followers("mizzou")
mu
```

---

## `get_followers()`

Unlike friends (limited by Twitter to 5,000), there is **no limit** on the number of followers.

To get user IDs of all 55(ish) million followers of @realDonaldTrump, you need two things:

1. A stable **internet** connection 
1. **Time** – approximately five and a half days

---

## `get_followers()`

Get all of Donald Trump's followers.

```r
## get all of trump's followers
rdt <- get_followers(
 "realdonaldtrump", 
 n = 56000000, 
 retryonratelimit = TRUE
)
```

---
class: inverse, center, middle

# 6. Lookup users

---

## `lookup_users()`

Lookup users-level (and most recent tweet) associated with vector of `user_id` or `screen_name`.

```r
## vector of users
users <- c("hadleywickham", "NateSilver538", "Nate_Cohn")

## lookup users twitter data
usr <- lookup_users(users)
usr
```

---

## `search_users()`

It's also possible to search for users. Twitter will look for matches in user names, screen names, and profile bios.

```r
## search for breaking news accounts
bkn <- search_users("breaking news")
bkn
```

---
class: inverse, center, middle

# 7. Lists

---

## `lists_memberships()`

+ Get an account's list memberships (lists that include an account)

```r
## lists that include Nate Silver
nsl <- lists_memberships("NateSilver538")
nsl
```

---

## `lists_members()`

+ Get all list members (accounts on a list)

```r
## all members of congress
cng <- lists_members(owner_user = "cspan", slug = "members-of-congress")

## all members of the cabinet
cab <- lists_members(owner_user = "cspan", slug = "the-cabinet")
```

> This actually led to an interesting piece of data science...

---

---

---
class: inverse, center, middle

# 8. Streaming tweets

---

## `stream_tweets()`

**Sampling**: small random sample (`~ 1%`) of all publicly available tweets

```r
## random (1%) stream sample
ss <- stream_tweets("")
```

**Filtering**: search-like query (up to 400 keywords)

```r
## stream by keyword
sf <- stream_tweets("mueller,fbi,investigation,trump,realdonaldtrump")
```

---

## `stream_tweets()`

**Tracking**: vector of user ids (up to 5000 user_ids)

```r
## user IDs from congress members (lists_members ex output)
st <- stream_tweets(cng$user_id)
```

**Location**: geographical coordinates (1-360 degree location boxes)

```r
## world-wide bounding box
sl <- stream_tweets(c(-180, -90, 180, 90))
```

---

## `stream_tweets()`

The default duration for streams is thirty seconds `timeout = 30`

+ Specify specific stream duration in seconds

```r
## stream for 10 minutes
stm <- stream_tweets(timeout = 60 * 10)
```

---

## `stream_tweets()`

Stream JSON data directly to a text file

```r
## stream text to file
stream_tweets(timeout = 60 * 10, 
  file_name = "random-stream-2018-11-13.json",
  parse = FALSE)
```

Read-in a streamed JSON file

```r
## parse JSON twitter data
rj <- parse_stream("random-stream-2018-11-13.json")
```

---

## `stream_tweets()`

Stream tweets indefinitely.

```r
## endless stream
stream_tweets(timeout = Inf, 
  file_name = "random-stream-2018-11-13.json",
  parse = FALSE)
```

---

## `lookup_coords()`

A useful convenience function–though it now requires an API key–for quickly looking up coordinates

```r
## stream tweets sent from london
luk1 <- stream_tweets(q = lookup_coords("London, UK"), timeout = 60)

## search tweets sent from london
luk2 <- search_tweets(
 geocode = lookup_coords("London, UK"), n = 1000)
```

---
class: inverse, center, middle

# Analyzing Twitter data

---

## Data set

For these examples, let's gather a data set of iPhone and Android users

```r
## search for tweets sent from android and iphone
iphone_android <- search_tweets(
 paste0('(filter:verified OR -filter:verified) AND \n', 
 '(source:"Twitter for iPhone" OR source:"Twitter for Android")'),
 include_rts = FALSE,
 n = 18000
)

## view breakdown of tweet source (device)
table(iphone_android$source)
```

---

## Text processing

Tokenize tweets [into words]

```r
## tokenize each tweet into words vecotr
wds <- tokenizers::tokenize_tweets(iphone_android$text)

## collapse back into strings
txt <- purrr::map_chr(wds, paste, collapse = " ")

## get sentiment using afinn dictionary
iphone_android$sent <- syuzhet::get_sentiment(
 iphone_android$text, method = "afinn"
)
```

---

## Compare groups

Group by source and summarize some numeric variables

```r
## group by device and summarise
iphone_android %>%
  group_by(source) %>%
  summarise(sent = mean(sent, na.rm = TRUE), 
    avg_rt = mean(retweet_count, na.rm = TRUE),
    avg_fav = mean(favorites_count, na.rm = TRUE),
    tweets = mean(statuses_count, na.rm = TRUE),
    friends = mean(retweet_count, na.rm = TRUE),
    followers = mean(retweet_count, na.rm = TRUE),
    ff_rat = (friends + 1) / (friends + followers + 1)
  )
```

---

## Features

Easily automate feature extraction for Twitter data.

```r
## install package
remotes::install_github("mkearney/textfeatures")

## feature extraction
tf <- textfeatures::textfeatures(iphone_android)

## add dependent variable
tf$y <- tweet_source_data$source == "Twitter for iPhone"
```

---

## Machine learning

Run a boosted model

```r
## use {gbm} to estimate model
m1 <- gbm::gbm(y ~ ., data = tf[1:15000, -1], n.trees = 200)
#summary(m1)

## generate predictions
p <- predict(m1, newdata = tf[15001:nrow(tf), -1],
 type = "response", n.trees = 200)

## how'd we do?
table(p > .50, tf$y[15001:nrow(tf)])
```

---

## {xgboost}

```r
## randomly sample 80% of rows in x (model matrix)
train_rows <- sample(seq_len(nrow(x)), ceiling(nrow(x) * .8))

## select rows not sampled [in train_rows]
test_rows <- tfse::nin(seq_len(nrow(x)), train_rows)

## set eta (learn rate) and nrounds (number of trees)
m1 <- xgboost::xgboost(
 x[train_rows, ],
 label = y[train_rows],
 eta = .01,
 nrounds = 1200)
```

---

## {xgboost} cont'd

```r
## view predictor influence
xgboost::xgb.importance(model = xgboost::xgb.Booster.complete(m1))

## get predictions for test data
pred <- predict(m1, newdata = x[test_rows, ])
```

---
class: inverse, center, middle

# Case study

---

# Problem:
- Use of bots on social media to manipulate public opinions, amplify social divisions, and spread "fake news"
- **How do we accurately and dynamically classify bots on social media?**

---

# 2016 election

Concerns about **automated accounts, or bots, on social media** manipulating public opinion reached a fever pitch during the 2016 general election.

The most alarming concerns were related to behaviors of **Kremlin-linked bots** for:

+ pushing **fake news** stories
+ amplifying **social divisions**
+ misrepresenting **public opinion**

---

# Research question(s)

## How do we **identify bots on social media**?

## How can we do it in **real-time** so as to filter out inauthentic social media traffic?

---
class: inverse, center, middle

# Social media data

---

# Social media apps

There's been a lot of talk about Cambridge Analytica's abuse of social media data. **Reactions** can be summarized as follows:

1. **Facebook was negligent** because it inadequately regulated third party apps

1. **Cambridge Analytica was unethical** because it stole and used data they shouldn't have

&nbsp;

> In other words...people don't know exactly what FB and CA actually did, but they know it was wrong.

---

# Social media APIs

**Application Program Interfaces** (APIs) are sets of protocols that govern interactions between sites and users

Similar to web browsers but with different primary objective:
+ Web browsers **render content**
+ Web APIs **organize data**

For public APIs, many sites only allow access to **authorized** users
+ Twitter, Facebook, Instagram, Github, etc.

---

# Twitter vs. Facebook

For a case study about political bots on social media, Twitter is better because...

1. It's more open about its **data**
   + Unlike Facebook, which actually clamped down on data sharing (with developers) starting in 2014, **Twitter is quite open with its data**

1. It's more **public facing**
   + **Default privacy setting** is public
   + User **connections** are asymmetrical

---
class: inverse, center, middle

# Identifying bots

---

# Current approach

The creators of [**botometer**](https://botometer.iuni.iu.edu/#!/) maintain a list open sourced academic studies that identified bots.

But this approach is not without **limitations**:
+ Academic research moves slowly (especially with Twitter banning many of the bots)
+ Relatively small number have been collected
+ Labor intensive methods (human coding)

---

# New approach

One **solution** to these limitations is to combine the academic sources with an easy-to-automate method that takes advantage of naturally occurring human coding

1. Select a handful of previously **verified** bots
1. Look up the public **lists** that include those bots
1. Identify list **names** that self identify as bot lists
   - Perform validity checks on other listed accounts

---
class: inverse, center, middle

# Compiling data sets

---

# User/tweet data

Using the list of screen names for non-bot (based on previous academic research and human coding) and bot accounts, I gathered...

+ **tweet-level data** for the most recent 100 tweets posted by each user `rtweet::get_timelines()`

+ **user-level (or account-level) data** for each user `rtweet::lookup_users()`

The data returned by Twitter consists of over 80 features related to the tweet and author (user) of each tweet.

---

# Data sets

Randomly sample Twitter data into train and test data sets

+ **Train data**: approx. 70% of users in sample
   - Making sure there are a roughly equal number of bots and non-bots

+ **Test data**: remaining (30%) bot and non-bot accounts

---

# Features

In addition to the numeric variables already returned by Twitter, I used **{textfeatures}** to extract up to 26 features for each textual variable.

The text features were extracted for:

+ Tweets
+ Name
+ Screen name
+ Location
+ Description (bio)

---
class: inverse, center, middle

# Modeling

---

# Two models

Two models were created.

1. The **default model**, which uses both users and tweets data.
   - Rate limited by Twitter to 180 every 15 minutes
   
1. The **fast model**, which uses only users data
   - Rate limited by Twitter to 90,000 every 15 minutes

---

# Model results

A gradient boosted logistic model **{gbm::gbm()}** was trained and tested on the two data sets for both the default and fast models.

+ The **default model** was 93.53% accurate when classifying bots and 95.32% accurate when classifying non-bots

+ The **fast model** was 91.78% accurate when classifying bots and 92.61% accurate when classifying non-bots

Overall...

+ The **default model** was correct 93.8% of the time

+ The **fast model** was correct 91.9% of the time.

---
class: inverse, center, middle

# tweetbotornot

---

# Applications

+ This has since been assembled into an R package [**{tweetbotornot}**](https://github.com/mkearney/tweetbotornot)---coming soon to CRAN!

+ It has also been exported as a [Shiny web app](https://mikewk.shinyapps.io/botornot)

---

## Tweetbotornot

A package designed to estimate the probability of an account being a bot.

```r
## install from Github
remotes::install_github("mkearney/tweetbotornot")

## estimate some accounts
bp <- tweetbotornot::tweetbotornot(c(
 "kearneymw", 
 "realdonaldtrump",
 "netflix_bot", 
 "tidyversetweets",
 "thebotlebowski")
)
bp
```

---
class: inverse, center, middle

# Future directions

---

# Future research

Between the proliferation of news stories about Kremlin-linked bots on social media and the number of false positives reported to me about **tweetbotornot**, I'm starting to think...

The real challenge is not **identifying** social media bots but instead **defining** what is means to be a *bot* on social media

+ Future research should examine how and why people define certain accounts as bots and others as not bots

---
class: inverse, center, middle

# The end