Data
-- Here is a list of data sets that you may use for the class projects.
-- These data sets CANNOT be distributed. They are for use in this
class only.
-- To impress me most, work with the largest data set.
-- If you have or plan to collect your own data sets that have
such a "social" nature, with actors, edges and time-stamps on the
edges, and you are willing to share with this and future classes, that is
much appreciated.
-- You should also discuss with me before going ahead with your own data
set.
- ACM Author-Paper Data (6M)
Format of each line: paper_ID     year_of_publication     space_separated_list_of_authors
- DBLP Author-Paper Data (61M)
Format of each line: paper_title     year_of_publication     space_separated_list_of_authors
- IEEE Author-Paper Data (2.2M)
Format of each line: paper_ID     year_of_publication     space_separated_list_of_authors
- IMBD Actor-Movies Data (10M)
Format of each line: movie_title     opening_year     space_separated_list_of_main_actors
- BLOG comments Data (201M)
Format of each line:
BlogPost_ID    
BloggerID    
BlogPostTimeStamp    
CommentTimeStamp    
space_separated_list_of_commenter_IDs_who_commented_on_that_day
- Wikipedia Edit Data (330M)
Format of each line: wiki_ID     date_of_first_day_of_the_week     space_separated_user_ids_who_edited_on_that_week
- Twitter: all tweets from a
fixed set a few thousand users. (1.6G)
Format of each line: the standard JSON format provided by Twitter.
Useful fields are "id" (of tweeter), "friends_count", "created_at" (time stamp), "in_reply_to_user_id" (null means to followers; non-null means to a specific user)
Twitter Python Scripts:
Here are some potentially useful
Twitter scripts which you could use to develop your own scripts for the
Twitter API. They are PURELY AS IS. Don't even attempt to ask me
for help on them, its all you!
-
collect_tweets.py: input a text file with a list of user ids and output is json files of tweets for each user
-
collector.py: Collects statuses (tweets) from the Twitter stream (random
sampling of all tweets)
-
get_users.py: input is a list of users and output is a two files, one
listing each user's friends and followers and another listing the
union of the user sets
- Twitter: a random sampling of
all tweets in April 2011. (9G)
Format of each line: same as first Twitter data set.