This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
The Knowledge Exchange task and finish group defined a range of Twitter hashtags to be followed throughout the project Publishing reproducible research output. These hashtags were harvested via the rtweet libray and the Twitter API and then saved in csv format. Up to 7,500 tweets were allowed for each hashtag, and this number was never reached over the course of the monitoring period. The hashtags considered were:
- #Reproducibility
- #Replicability
- #ReproducibleScience
- #ResearchReproducibility
- #ReproducibleResearch
- #ResearchCredibility
- #GoodResearchPractices
- #RegisteredReports
- #GoodScience
- #ResearchCompendium (from 23/11 onwards)
- #ResearchCompendia (from 23/11 onwards)
- #ReproducibilityCrisis (from 23/11 onwards)
- #ReplicabilityCrisis (from 23/11 onwards)
- #ReplicationCrisis (from 23/11 onwards)
- #TuringWay (from 23/11 onwards)
Data was harvested on Mondays, starting from 09/11/2020. Please note that Twitter’s Terms of Service do not allow the sharing of full data, so only Tweet ids are available as part of this deposit.
Software environment
- R version 4.1.0 (2021-05-18)
- Platform: x86_64-w64-mingw32/x64 (64-bit)
- Running under: Windows 10 x64 (build 19042)
Library versions
- data.table 1.14.0
- dplyr 1.0.6
- ggplot2 3.3.4
- networkD3 0.4
- purrr 0.3.4
- RColorBrewer 1.1.2
- readr 1.4.0
- rmarkdown 2.11
- rtweet 0.7.0
- SnowballC 0.7.0
- stringr 1.4.0
- tidyverse 1.3.1
- tm 0.7.8
- wordcloud 2.6
Hardware
- Device: IdeaCentre A540-24ICB
- Processor: Intel(R) Core(TM) i5-9400T CPU @ 1.80GHz 1.80 GHz
- Installer RAM: 8.00 GB
- System type: 64-bit operating system, x64-based processor
Section 1 - Libraries
library(data.table)
library(dplyr)
options(dplyr.summarise.inform = FALSE)
library(ggplot2)
library(networkD3)
library(purrr)
library(RColorBrewer)
library(readr)
library(rmarkdown)
library(rtweet)
library(SnowballC)
library(stringr)
library(tidyverse)
library(tm)
library(wordcloud)
Section 2 - Data import
The csv data in different harvested files is imported into a single dataset and deduplicated by Tweet id.
# Clear the environment (RStudio)
rm(list = ls())
# This reads any number of Twitter .csv datasets harvested via rtweet.
tbl <-
list.files(pattern = "*.csv") %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
# The tweets are de-duplicated by status id: the datasets downloaded may overlap, for example in cases where people used more than one of the hashtags monitored.
Twitter_data <- tbl[!duplicated(tbl$status_id),]
Section 3 - Relevance checks
Accounts that DO NOT include the words below in their description are likely to be irrelevant or invalid. They might have used the hashtags monitored in a different context that is not appropriate to this analysis. This approach is an approximation and will not be 100% accurate (e.g. if someone’s profile description is blank). However, without filtering we would be very likely to consider irrelevant accounts and tweets. The below list of words has been developed by reviewing a random set of twitter accounts gathered in the dataset and enriched via personal knowledge of the sector.
account_description_validation <- c('academia', 'academic', 'academics', 'analysis group', 'article', 'articles', 'assistant prof', 'assistant professor', 'associate prof', 'associate professor', 'associate director', 'biology', 'biomedical', 'book', 'ciencias', 'clinical trial', 'clinical trials', 'college', 'consortium', 'copyright', 'develop', 'digital object', 'director of', 'discover', 'doctoral', 'doi', 'editor', 'Editor-in-Chief', 'evidence', 'evidence base', 'evidencebased', 'head of', 'higher education', 'highered', 'humanities', 'information science', 'information sciences', 'institute', 'institutes', 'institution', 'institutions', 'interdisciplinary', 'journal', 'journals', 'learning', 'lecturer', 'librarian', 'librarians', 'libraries', 'library', 'licence', 'license', 'licensing', 'LIS', 'manuscript', 'manuscripts', 'medicine', 'metrics', 'modelling', 'museum', 'open access', 'open data', 'open knowledge', 'open research', 'open scholarship', 'paper', 'papers', 'peer review', 'peer reviewed', 'peer-review', 'peer-reviewed', 'ph.d. candidate', 'PhD', 'PhD candidate', 'postdoc', 'post-doc', 'preprint', 'pre-print', 'preprints', 'pre-prints', 'press', 'principal investigator', 'prof', 'prof.', 'professor', 'public domain', 'publication', 'publish', 'publisher', 'publishing', 'recherche', 'recherches', 'relationship between', 'research', 'research data', 'researcher', 'scholar', 'scholarly', 'scholarly communication', 'school', 'scicomm', 'science', 'sciences', 'scientific', 'scientist', 'scientists', 'society of', 'student', 'teacher', 'teaching', 'universities', 'university')
account_description_validation_string <- paste(account_description_validation, collapse="|")
# Accounts are marked as "Keep" or "Discard" based on the above list of words.
Twitter_data$relevance_check <- ifelse(grepl(account_description_validation_string, Twitter_data$description, ignore.case = TRUE), "Keep", "Discard")
# Only the accounts marked as "Keep" are taken forward.
Twitter_data <- Twitter_data[Twitter_data$relevance_check == 'Keep',]
Section 4 - Data cleaning
The text of the tweets is cleaned from odd characters and URLs using regex.
# Remove Unicode format and other textual oddities.
Twitter_data$text <- str_replace_all(Twitter_data$text,"\\<U[^\\>]*\\>"," ")
Twitter_data$text <- str_replace_all(Twitter_data$text,"\r\n"," ")
Twitter_data$text <- str_replace_all(Twitter_data$text,"&"," ")
# Create a new column to save the original text before any further cleaning - this is just a backup.
Twitter_data$Original_Tweet_Backup <- Twitter_data$text
# Continue cleaning, removing hashtags, mentions and "RT". Note that hashtags are saved in a dedicated column so this is just removing them from the body of the tweet.
Twitter_data$text <- str_replace_all(Twitter_data$text,"^RT:? "," ")
Twitter_data$text <- str_replace_all(Twitter_data$text,"@[[:alnum:]]+"," ")
Twitter_data$text <- str_replace_all(Twitter_data$text,"#[[:alnum:]]+"," ")
Twitter_data$text <- str_replace_all(Twitter_data$text,"http\\S+\\s*"," ")
Section 5 - Overview of posting times
A chart of tweets by date is created.
ts_plot(Twitter_data, "hours") +
labs(x = NULL, y = NULL,
title = "Frequency of tweets vs time",
subtitle = paste0(format(min(Twitter_data$created_at)), " to ", format(max(Twitter_data$created_at))),
caption = "Data collected from Twitter's API (rtweet)") +
theme_minimal()

Section 7 - Analysis of mentions
Mentions are analysed to identify the most mentioned accounts. Results are shown in a word cloud and a table.
# Exclude retweets, because they are considered as mentions in the data. If you retweet someone, the API considers that as you mentioning them.
Twitter_data_no_retweets <- Twitter_data[Twitter_data$is_retweet == 'FALSE',]
mentions_vector <- Twitter_data_no_retweets$mentions_screen_name
# Split the mentions column using the space separator. "100" allows room for 100 columns, just in case (note that this is not possible with Twitter's character limit!).
split_mentions <- str_split_fixed(mentions_vector, " ", 100)
split_mentions_single_column <- stack(data.frame(split_mentions))
split_mentions_single_column <- data.frame(split_mentions_single_column$values)
# Get rid of rows that are blank
split_mentions_single_column_clean <- split_mentions_single_column[split_mentions_single_column != "", ]
wordcloud(split_mentions_single_column_clean, min.freq=5, random.order=FALSE, colors=brewer.pal(9, 'Reds')[4:9])

# If you want the word cloud in a table:
split_mentions_single_column_clean <- as.data.frame(split_mentions_single_column_clean)
names(split_mentions_single_column_clean)[1] <- 'account'
split_mentions_single_column_clean <- split_mentions_single_column_clean %>%
group_by(account) %>%
summarise(weight = n()) %>%
ungroup()
split_mentions_single_column_clean <- split_mentions_single_column_clean[order(-split_mentions_single_column_clean$weight), ]
paged_table(head(split_mentions_single_column_clean, 50))
Section 8 - Analysis of links shared
Links shared in the tweets harvested are shown as a table, to identify literature sources and any relevant events for inclusion in the study.
top_urls <- Twitter_data[, c("urls_expanded_url")]
top_urls <- top_urls[complete.cases(top_urls), ] # This gets rid of rows with missing values
top_urls <- top_urls %>%
group_by(urls_expanded_url) %>%
summarise(count = n()) %>%
ungroup()
top_urls <- top_urls[order(-top_urls$count),]
paged_table(head(top_urls, 50))
Section 11 - Identification of accounts with the most followers in the sample
The most popular individuals or organisations in the dataset are reviewed to understand the dynamics of social media discourse: who are the tweeters with the largest possible reach? The results are shown in a table.
# Select a list of unique accounts.
highestFollowers <- Twitter_data %>% select(screen_name, followers_count, friends_count, description, location)
highestFollowers_unique <- highestFollowers[!duplicated(highestFollowers$screen_name),]
# The data table is all characters, so relevant columns have to be converted into numbers.
highestFollowers_unique$followers_count <- as.numeric(as.character(highestFollowers_unique$followers_count))
highestFollowers_unique$friends_count <- as.numeric(as.character(highestFollowers_unique$friends_count))
# Sort the table of stakeholders by number of followers and number of friends.
stakeholder_highestFollowers <- highestFollowers_unique[order(-highestFollowers_unique$followers_count, -highestFollowers_unique$friends_count),]
stakeholder_highestFollowers_table <- select(stakeholder_highestFollowers, screen_name, followers_count, friends_count)
paged_table(head(stakeholder_highestFollowers_table,50))
Section 12 - Analysis of the most commonly used words in the dataset
Corpus analysis is used to analyse the words most commonly used in the dataset. The results are shown in a word cloud to gain an understanding of the language used when discussing reproducibility.
# The tweet's text is in the column called "text" in the data table called "Twitter_data".
data_for_corpus <- Twitter_data %>% select(screen_name, text)
# Build the corpus for analysis.
corpus <- Corpus(VectorSource(data_for_corpus$text))
# The corpus is cleaned and standardised.
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
# Remove stop words by using a pre-defined list for the English language and additional words in quotes.
mystopwords <- c(stopwords("english"),"rt","get","like","just","yes","know","will","good","day","people", "got", "can", "amp")
corpus <- tm_map(corpus,removeWords,mystopwords)
# Create a document term matrix.
myDtm <- DocumentTermMatrix(corpus)
sparse <- removeSparseTerms(myDtm, 0.97)
sparse <- as.data.frame(as.matrix(sparse))
# Calculate the frequency of each word from the data table created - colSums adds up the totals by column.
# freqWords is a row of numbers, which has the words as the column headers.
freqWords <- colSums(sparse)
freqWords <- freqWords[order(-freqWords)]
wordcloud(freq = as.vector(freqWords), words = names(freqWords),random.order = FALSE,
random.color = FALSE, colors = brewer.pal(9, 'Reds')[4:9])

paged_table(head(as.data.frame(freqWords), 50))
