Code Challenge 05 – Twitter data analysis Part 2: Similar Tweeters

Post Details

Company

Pybites

Date Published

Feb. 10, 2017

Author

PyBites Team

Word Count

740

Language

English

Hacker News Points

-

Source URL

pybit.es/articles/codechallenge05_review

Summary

The recent code challenge involved using the Gensim library to calculate the similarity between Twitter users based on their tweets, marking an exploration into natural language processing. Initially, 200 tweets from 15 users, mostly Python enthusiasts, were analyzed, but this dataset was deemed too small, leading to the collection of 3,200 tweets per user for better results. The method involved tokenizing tweets, removing stopwords and links, and employing Latent Dirichlet Allocation (LDA) to rank user similarities, with results varying significantly between runs. Despite the complexity and initial challenges, the exercise provided valuable insights into the importance of input data quality in data science, encouraged community feedback, and invited participants to share their experiences and improvements.

Code Challenge 05 – Twitter data analysis Part 2: Similar Tweeters – Review