Analyze Tweet Data

Wrangling and analysis of Tweets from WeRateDogs (@dogrates) and visualization of insights with Python in Jupyter Notebook. Project was motivated by and thus focuses on the data wrangling process that covers gathering, assessing and cleaning data. Various methods including programmatic approaches such as querying Twitter API with Python's Tweepy package were used to collect Tweets and relevant metadata.

Software Requirements

conda 4.6.3 or similar versions
python 3.7.2 (or python 3)

Packages

pandas
requests
os

tweepy 3.7.0

git clone git://github.com/tweepy/tweepy.git
cd tweepy
python setup.py install

timeit.default_timer
json
numpy
copy
datetime
matplotlib.pyplot
seaborn
statsmodels.api

TextBlob

pip install -U textblob
python -m textblob.download_corpora

twitter_api.py file

The file returns Twitter API Wrapper for querying Twitter's API in Section 3 of Gathering Data with the Tweet IDs obtained from the first dataset in Section 1 of Gathering Data.
The file was imported to the notebook and was not tracked in the repository in order to prevent disclosure of private keys and tokens.

  import tweepy

  def twitter_api():

    # Keys and Tokens
    consumer_key = 'CONSUMER KEY'
    consumer_secret = 'CONSUMER SECRET'
    access_token = 'ACCESS TOKEN'
    access_secret = 'ACCESS SECRET'

    # OAuthHandler instance equipped with an access token for OAuth Authentication
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    # twitter API wrapper
    return tweepy.API(auth_handler = auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

Part 1: Data Wrangling

Content of the first part of the project including all code blocks for data wrangling is documented in the Jupyter Notebook file, analyze_tweet_1_wrangle.ipynb. The HTML file analyze_tweet_part1_wrangle.html was published from this notebook file.
Three raw datasets gathered in the first step of data wrangling and the clean version of the master dataset obtained after the last step of data wrangling are all available in the \data directory.
- Raw Datasets
  - twitter_archive_enhanced.csv
  - image-predictions.tsv
  - tweet_json.txt
- Merged Dataset: twitter_archive_master.csv

Gathering Data

Enhanced Twitter Archive
- Udacity was provided with WeRateDogs Twitter Archive which contains basic tweet data for all 5,000+ tweets.
- Udacity enhanced this original dataset by extracting dog ratings, dog names, and dog "stage" from the tweets' text data.
- The enhanced Twitter archive was made available as twitter_archive_enhanced.csv for manual download and is assigned to the object df_archive.
Image Predictions
- Udacity ran the images included in the tweets from the enhanced Twitter archive through a neural network and made top three predictions of each dog's breed.
- The image-predictions.tsv file hosted in Udacity's server is programmatically downloaded by using the Requests library to submit a request to the URL and is assigned to the object df_image.
Additional Tweet Data
- Additional tweet data which were omitted during the process of enhancing the twitter archive are gathered by using Python's Tweepy library to query Twitter's API.
- The JSON data of each tweet is dumped in the tweet_json.txt file.
- Only the re-tweet and favorite counts for each tweet are extracted and assigned to the object df_json.

Assessing Data

Each of the three datasets gathered above are assessed for quality and tidiness issues.
Only these observations from reviewing the datasets, which render cleaning necessary in the next section, are documented.

Cleaning Data

Each assessment from the Assessing Data section is addressed in three sequential steps: define, code, and test.
The clean versions of the three datasets are merged to create df_archive_master, which is stored as a separate .csv file, twitter_archive_master.csv.

Part 2: Exploratory Data Analysis and Data Visualization

Content of the second part of the project including all code blocks for analysis is documented in the Jupyter Notebook file, analyze_tweet_2_eda.ipynb. The HTML file analyze_tweet_part2_eda.html was published from this notebook file.
Investigation of the following four topics are discussed.
1. Time of the Day when WeRateDogs (@dogrates) Shows Most Activity
  - Each day of the week is divided into 24-hour increments.
  - Tweet activity is quantified by the percentage of tweets in each increment for a given day.
  - Bar plots, box plots and statistics (mean, standard deviation, quartiles) of the 24 percentages for each day are presented.
2. Correlation between Favorite Counts and Retweet Counts
  - Correlation between two variables are discussed with a scatter plot and their correlation coefficient.
  - Results from linear regression are applied to further investigate the correlation.
    - R-squared value and its significance are presented.
    - Linear fit is built and added to the scatter plot.
3. Comparison of Dog Ratings and Sentiment of Tweets
  - Numeric dog ratings are categorized into low, medium, and high ratings.
  - Polarity scores of the text data of Tweets are gathered from sentiment analysis.
  - Three histograms of the polarity scores, one for each category of dog ratings, are presented.
4. Accuracy and Precision of Predicting Dog's Breeds from Images
  - The performance of the neural network in recognizing images and predicting each dog's breed is assessed with both statistics and visualizations.
    - Mean proportion of predictions with dog breeds for each level of prediction
    - Histogram for the distribution of confidence levels for each level of prediction and its center (median)

Author

Jong Min (Jay) Lee [jmlee5629@gmail.com]

Acknowledgement

This project was completed as a mandatory requirement for the Data Wrangling unit from the Data Analyst Nanodegree program at Udacity.
Step-by-step guidance from the Get Started page in Twitter Developers site was referenced to create an App, generate keys and tokens, and query Twitter API.
Tweepy documentation was referenced to search for and understand the methods from the Tweepy package and their specifications and arguments which were applicable to querying Twitter API and gathering JSON data.
Assessment and cleaning of untidy data were motivated by the extensive nature this process in data analysis (Dasu and Johnson 2003) and the framework for tidying data (Wickham 2014).
- Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons.
- Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10). doi:10.18637/jss.v059.i10
TextBlob documentation was referenced to create a TextBlob with text data and obtain polarity score from the sentiment property.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

README.md

README.md

analyze_tweet_1_wrangle.ipynb

analyze_tweet_1_wrangle.ipynb

analyze_tweet_2_eda.ipynb

analyze_tweet_2_eda.ipynb

analyze_tweet_part1_wrangle.html

analyze_tweet_part1_wrangle.html

analyze_tweet_part2_eda.html

analyze_tweet_part2_eda.html

Repository files navigation

Analyze Tweet Data

Software Requirements

Part 1: Data Wrangling

Gathering Data

Assessing Data

Cleaning Data

Part 2: Exploratory Data Analysis and Data Visualization

Author

Acknowledgement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
data		data
.gitignore		.gitignore
README.md		README.md
analyze_tweet_1_wrangle.ipynb		analyze_tweet_1_wrangle.ipynb
analyze_tweet_2_eda.ipynb		analyze_tweet_2_eda.ipynb
analyze_tweet_part1_wrangle.html		analyze_tweet_part1_wrangle.html
analyze_tweet_part2_eda.html		analyze_tweet_part2_eda.html

jmlcode/p3-analyze-tweet-data

Folders and files

Latest commit

History

Repository files navigation

Analyze Tweet Data

Software Requirements

Part 1: Data Wrangling

Gathering Data

Assessing Data

Cleaning Data

Part 2: Exploratory Data Analysis and Data Visualization

Author

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Languages