Skip to content

Wrangling and analysis of Tweets from WeRateDogs (@dogrates) with Python in Jupyter Notebook. Project focuses on gathering, assessing and cleaning data. Various methods, including Python's Requests and Tweepy packages for performing a GET Request and querying Twitter API, were used to collect Tweets and relevant data available online.

jmlcode/p3-analyze-tweet-data

Repository files navigation

Analyze Tweet Data

Wrangling and analysis of Tweets from WeRateDogs (@dogrates) and visualization of insights with Python in Jupyter Notebook. Project was motivated by and thus focuses on the data wrangling process that covers gathering, assessing and cleaning data. Various methods including programmatic approaches such as querying Twitter API with Python's Tweepy package were used to collect Tweets and relevant metadata.

Software Requirements

  • conda 4.6.3 or similar versions
  • python 3.7.2 (or python 3)
  • Packages
    • pandas
    • requests
    • os
    • tweepy 3.7.0
      git clone git://github.com/tweepy/tweepy.git
      cd tweepy
      python setup.py install
    • timeit.default_timer
    • json
    • numpy
    • copy
    • datetime
    • matplotlib.pyplot
    • seaborn
    • statsmodels.api
    • TextBlob
      pip install -U textblob
      python -m textblob.download_corpora
  • twitter_api.py file
    • The file returns Twitter API Wrapper for querying Twitter's API in Section 3 of Gathering Data with the Tweet IDs obtained from the first dataset in Section 1 of Gathering Data.
    • The file was imported to the notebook and was not tracked in the repository in order to prevent disclosure of private keys and tokens.
      import tweepy
    
      def twitter_api():
    
        # Keys and Tokens
        consumer_key = 'CONSUMER KEY'
        consumer_secret = 'CONSUMER SECRET'
        access_token = 'ACCESS TOKEN'
        access_secret = 'ACCESS SECRET'
    
        # OAuthHandler instance equipped with an access token for OAuth Authentication
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_token, access_secret)
    
        # twitter API wrapper
        return tweepy.API(auth_handler = auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

Part 1: Data Wrangling

  • Content of the first part of the project including all code blocks for data wrangling is documented in the Jupyter Notebook file, analyze_tweet_1_wrangle.ipynb. The HTML file analyze_tweet_part1_wrangle.html was published from this notebook file.
  • Three raw datasets gathered in the first step of data wrangling and the clean version of the master dataset obtained after the last step of data wrangling are all available in the \data directory.
    • Raw Datasets
      • twitter_archive_enhanced.csv
      • image-predictions.tsv
      • tweet_json.txt
    • Merged Dataset: twitter_archive_master.csv

Gathering Data

  1. Enhanced Twitter Archive
    • Udacity was provided with WeRateDogs Twitter Archive which contains basic tweet data for all 5,000+ tweets.
    • Udacity enhanced this original dataset by extracting dog ratings, dog names, and dog "stage" from the tweets' text data.
    • The enhanced Twitter archive was made available as twitter_archive_enhanced.csv for manual download and is assigned to the object df_archive.
  2. Image Predictions
    • Udacity ran the images included in the tweets from the enhanced Twitter archive through a neural network and made top three predictions of each dog's breed.
    • The image-predictions.tsv file hosted in Udacity's server is programmatically downloaded by using the Requests library to submit a request to the URL and is assigned to the object df_image.
  3. Additional Tweet Data
    • Additional tweet data which were omitted during the process of enhancing the twitter archive are gathered by using Python's Tweepy library to query Twitter's API.
    • The JSON data of each tweet is dumped in the tweet_json.txt file.
    • Only the re-tweet and favorite counts for each tweet are extracted and assigned to the object df_json.

Assessing Data

  • Each of the three datasets gathered above are assessed for quality and tidiness issues.
  • Only these observations from reviewing the datasets, which render cleaning necessary in the next section, are documented.

Cleaning Data

  • Each assessment from the Assessing Data section is addressed in three sequential steps: define, code, and test.
  • The clean versions of the three datasets are merged to create df_archive_master, which is stored as a separate .csv file, twitter_archive_master.csv.

Part 2: Exploratory Data Analysis and Data Visualization

  • Content of the second part of the project including all code blocks for analysis is documented in the Jupyter Notebook file, analyze_tweet_2_eda.ipynb. The HTML file analyze_tweet_part2_eda.html was published from this notebook file.
  • Investigation of the following four topics are discussed.
    1. Time of the Day when WeRateDogs (@dogrates) Shows Most Activity
      • Each day of the week is divided into 24-hour increments.
      • Tweet activity is quantified by the percentage of tweets in each increment for a given day.
      • Bar plots, box plots and statistics (mean, standard deviation, quartiles) of the 24 percentages for each day are presented.
    2. Correlation between Favorite Counts and Retweet Counts
      • Correlation between two variables are discussed with a scatter plot and their correlation coefficient.
      • Results from linear regression are applied to further investigate the correlation.
        • R-squared value and its significance are presented.
        • Linear fit is built and added to the scatter plot.
    3. Comparison of Dog Ratings and Sentiment of Tweets
      • Numeric dog ratings are categorized into low, medium, and high ratings.
      • Polarity scores of the text data of Tweets are gathered from sentiment analysis.
      • Three histograms of the polarity scores, one for each category of dog ratings, are presented.
    4. Accuracy and Precision of Predicting Dog's Breeds from Images
      • The performance of the neural network in recognizing images and predicting each dog's breed is assessed with both statistics and visualizations.
        • Mean proportion of predictions with dog breeds for each level of prediction
        • Histogram for the distribution of confidence levels for each level of prediction and its center (median)

Author

Jong Min (Jay) Lee [jmlee5629@gmail.com]

Acknowledgement

  • This project was completed as a mandatory requirement for the Data Wrangling unit from the Data Analyst Nanodegree program at Udacity.
  • Step-by-step guidance from the Get Started page in Twitter Developers site was referenced to create an App, generate keys and tokens, and query Twitter API.
  • Tweepy documentation was referenced to search for and understand the methods from the Tweepy package and their specifications and arguments which were applicable to querying Twitter API and gathering JSON data.
  • Assessment and cleaning of untidy data were motivated by the extensive nature this process in data analysis (Dasu and Johnson 2003) and the framework for tidying data (Wickham 2014).
    • Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons.
    • Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10). doi:10.18637/jss.v059.i10
  • TextBlob documentation was referenced to create a TextBlob with text data and obtain polarity score from the sentiment property.

About

Wrangling and analysis of Tweets from WeRateDogs (@dogrates) with Python in Jupyter Notebook. Project focuses on gathering, assessing and cleaning data. Various methods, including Python's Requests and Tweepy packages for performing a GET Request and querying Twitter API, were used to collect Tweets and relevant data available online.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published