(Text Box comment IEEE Homeland Security Technologies Conference, Boston, MA, May 12-13, 2008 ) Discovery, Analysis and Monitoring of Hidden Social Networks and Their Evolution Mark Goldberg, Mykola Hayvanovych, Apirak Hoonlor, Stephen Kelley, Malik Magdon-Ismail, Konstantin Mertsalov, Boleslaw Szymanski and William Wallace Rensselaer Polytechnic Institute, Troy, NY, 12180 {goldberg,hayvam,hoonla,kelles,magdon,mertsk2,szymansk}@cs.rpi.edu, wallaw@rpi.edu Abstract—Social networks that arise spontaneously and evolve over time have become an important component of ever growing global societies used for spreading ideas and indoctrinating people. Their loose membership and dynamics make them difficult to observe and monitor. We present a set of tools for discovery, analysis and monitoring evolution of hidden social groups on the internet and in cyberspace. Two complementary kinds of tools form a core of our approach. One is based on statistical analysis of communication network without considering communication content.The other focuses on communication content and analyzes recursive patterns arising in it. First, we present a software system SIGHTS (Statistical Identification of Groups Hidden in Time and Space), designed for the discovery, analysis, and knowledge visualization of social coalition in communication networks by analyzing communication patterns. We discuss how our algorithms extract groups and track their evolution in Enron-email dataset and in Blog data. The goal of SIGHTS is to assist an analyst in identifying relevant information. A complementary set of tools uses Recursive Data Mining (RDM)to identify frequent patterns in communication content such as email, blog or chat-room sessions. Our approach enablesdiscovery ofpatterns at varyingdegrees of abstraction, in a hierarchical fashion, and in language independent way. We use RDM to distinguish among different roles played by communicators in social networks (e.g., distinguishing between leaders and members). Experiments on the Enron dataset, which categorize members into organizational roles demonstrate that use of the RDM dominant patterns improves role detection. 1. Introduction and Motivation Modern means of communication, such as e-mail, web-logs, and chatrooms, allow individuals to communicate in a number of new ways, and new forms of communication are continually appearing. This vast and ever growing digital communication data is invaluable for analysis of social coalition and their evolution, yet it must be processed before the relevant information can be extracted. While some coalitions publicize their presence in the network, the intention of others is to hide their communication, and their existence, within the large body of all communication in the network. Yet for some groups, their members are not even aware of the existence of groups to which they belong. Social network analysis (see [1], [2], [3], [4], [5], [6]) can be used to understand the social dynamics caused by major events such as unrest in Muslim communities around the world created by the publication of cartoons in a Dutch newspaper in September of 2005. Information about this publication spread through social networks reaching countries that are geographically distant from the origin of publication. We present here a system called SIGHTS design for social network analysis of hidden groups. The main objective of SIGHTS is to provide an umbrella for various algorithms for analyzing communication data with the goal of detecting social coalitions, primarily, “hidden” groups. The graphical user interface of SIGHTS contains facilities to help the analyst examining and manipulating the results of the algorithms. Any social network group has its structure. For instance, in a company, a social network group composes of the presidents, the secretaries, the managers, the employees and etc. Identifying the role of each member in the group is of primary interest to any group analysis. In any form of communication, traces of personal styles can be identified by authorship analysis to automate artists style classification process [7] to discover intrusion detection [8], and to compress the data [9]. Yet, it was not clear if such stylistic difference can be discern between groups of communicators, based on the roles of sender and receiver of messages. To address this problem, we used Recursive Data Mining (RDM) for distinguishing the roles of the communicators in a social group. In general, RDM discovers, in a recursive manner, statistically significant patterns in a stream of data. The key properties of the pattern discovery in RDM include: (i) no restriction of the size of gaps between patterns, (ii) recursive mining in which discovered patterns are replaced by the new token and the mining is repeated on the newly created string, (iii) tolerance to imperfect matching. 2. SIGHTS The three main modules of SIGHTS (see Figure 1) are: Data Collection/Storage, Data Learning and Analysis; and Data Sources Store Semantic Graph and MetaData Semantic Graph Interactive Visualization and MetaData Derived Data Database Blogs ChatroomsInternet/ Live Journal Email Data Real Time Clustering Size vs Density Leader Identification Graph of Clusters Topic Identification Group Persistence Cycle Group Analysis Group Evolution Stream Group Analysis Algorithm Modules Fig. 1. SIGHTS System Architecture(currently the link from Chatrooms is not functional) Knowledge Extraction/Visualization. The Data Collection Modules operate on semantic graphs. The graphs are constructed by adding a node for each social network actor and a directed edge from sender’s node to a recipient’s node. The edges are marked with the time of the communication and, possible, other labels. Some edge labels are only appropriate for specified types of graphs. The user may have communication data existing in a variety of formats. SIGHTS handles the stand-alone input of a reasonable range of these formats in order to facilitate the introduction of new data into the program. Among these is a plain-text XML format which is well-documented. SIGHTS is also able to read from a database that is constructed according to specified guidelines. Blogs data is collected from LiveJournal.com blogs service provider. The semantic graph is constructed by creating a node for each blogger and the edge between any pair of bloggers who participated in the discussion in the comments of a post. Blogs collector monitors LiveJournal.com update feed and records the permanent address of the post. Two weeks after the date of the initial post, the blogs collector visits the page of the post and collects the thread of comments using the screen-scraping techniques. Blogs collector allows the establishment of “interest filters” that can narrow the data collection to posts on a certain topic. Blogs collector provides the interface for the analyst to tag posts as interesting and not interesting that will create the training set for the interest learning program. This information is also stored in the database and is accessible to other modules of the application. The Data Processing Modules process the semantic graph and meta data obtained during the data collection to retrieve additional information. They run in parallel with data F A C E G D H I B J A C E G B F D I K C BD M LF J Sept. 2000 -Sept. 2001 Mar. 2000 -Mar. 2001 Sept. 2001 -Sept. 2002 Fig. 2. Evolution of a part of the Enron organizational structure from 2000 -2002. All three graphs represents frequent structures; the labels indicate actors of the Enron community. Note: that B,C, D, F are present in all three time-intervals. collection and the identified groups and other data are stored in the database. Using visualization module user may view the data. Temporal group algorithms identify hidden groups in the stream of communications for the user specified time scales [10], [11]. Noteworthy among them are the cycle model algorithms that identify all groups of users who persistently communicate over a time interval and the stream model algorithm that finds groups based on frequently communicating triples followed by merging correlated triples. Such groups usually communicate over multiple time intervals during the period of interest in a streaming fashion. Our algorithms also give the group structure hierarchy and can be modified to track evolution. An example of the evolution of a group found in the ENRON email data set is shown in the Figure 2. The user may cluster a specific set of actors using communications collected over a specific interval of time. The details of this algorithm are presented in [12], [13]. SIGHTS then provides the user with an intuitive way to interact with the results, as shown in Figure 3. The Opposition Identification Module identifies the positive and negative sentiments between pairs of bloggers based on the length and average size of the messages in the conversations that took place among them. The threads of comments that appear on LiveJournal.com are split into conversation between pairs of bloggers. The module employs the Support Vector Machine classifier that was trained using a data set that was manually created to determine the oppositions between bloggers using the length of the conversation and the average length of the message in the conversation to determine whether bloggers opposed each other in a given conversation. 3. Interactive Visualizations in SIGHTS SIGHTS provides a GUI for interactive visualization of the results obtained from data collection and data processing mod Fig. 3. Size vs. Density Plot of SIGHTS Sample Analysis (Each dot represents a group, bold dots are groups of high interest level with high amount of communication) ules. Currently system supports four different visualizations for the analyst: Size vs. Density plot, graph of Clusters plot, Group Persistence view and Group evolution view. The interactive Graph of Overlapping Clusters displays the groups discovered in the time range determined by the range selector bars. In this view each grey dot represents an actor and grey links represent the background communications between actors if such exist. Every group is denoted as a green square, and the links from the green square to grey dots show which actors are members of this group. It is also possible to visualize group persistence. Each time cycle has a number of rectangles which belong to this cycle and represent groups. When such group is selected, its members will be shown in the “Selected Group” display. The selected group is entirely colored blue while other groups can be either colored grey, which means they do not have any members in common with selected group, or a group can be partially/entirely colored green or blue, which respectively means that it contains some but not all of the members from the selected group, or has all of the members of the selected group as well as in both cases (green or blue) groups can possibly have some other actors in them. Leaders and Group evolution view displays leadership information and group members. Clicking on a coalition will find it’s evolution. The currently selected group will be outlined in blue. Coalitions that acted as ancestors to the selected group will be outlined in green if they were discovered in a cycle that is currently displayed in the plot. Descendants of a selected coalition will be outlined in black if they were discovered in cycles currently displayed in the plot. 4. Recursive Data Mining Members of a social group or an organization can usually be partitioned hierarchically into subgroups. The lowest level of such partitioning would contain basic or elementary members, while the highest would identify leader(s). Such hierarchies are amenable for efficient processing via Recursive Data Mining (RDM) [14], a new approach for identifying patterns in sequences of tokens. In the RDM process, the recursive (hierarchical) search for repeating patterns captures distinctive features at varying levels of abstraction. Intuitively, at lower levels the patterns obtained are more specific, resulting limiting the potential for valid matches (M). At higher levels, the patterns are more general, increasing the chances for a match. On the other hand, with increasing levels, the number of patterns found decreases monotonically. In this paper, we present the significant improvement on RDM compared to the previous version [14], using an ensemble of classifiers, each for different level of RDM abstraction, instead of having only one classifier for the entire process. The overall RDM process is outlined in Algorithm 1. Algorithm 1 Recursive Data Mining Require: Set of sequences S Ensure: Sets of patterns (features) L, one for each level 1: L ={} 2: i = 0 3: S = S {Level-0} 4: repeat 5: PALL = pattern_generation(S) 6: PSIG = sig_patterns(S, PALL) 7: D = get_domi_patterns(S,PSIG) 8: Li = D 9: i++ 10: S = make_next_level(S,D) {Level-i} 11: until M == ∅ ∨i == max_level 12: return L The process starts with a sliding window of predefined length passing over the input sequence one token at a time. At each stop, patterns with all possible combinations of tokens and gaps are recorded. When pass is completed, the recorded patterns are checked for frequency of their occurrence. Some patterns could be either too specific to a certain text or insignificant because they contain very commonly occurring words. In either case, they are ineffective in classifying the mined text while adding to the computational cost of the algorithm. The “usefulness" of a pattern is computed via a statistical significance test. A pattern is deemed significant if its frequency of occurrence (based on a unigram model) is larger than the expected number of occurrence in a random string. Patterns that are deemed insignificant are eliminated from further consideration. At each position in the sequence, the tokens in the significant patterns are matched against the tokens in the sequence. The 80 9 80 2 81 45 33 14 10 25 61 21,6,−,− 2 2 10,21,6,14 21,6,14,− 10,21,6,− 6,14,−,8 2 ........ 3 2 Pattern Count Pat. Id 2 75 10,21,6,14,−,8 10,21,6,−,− 80 811 ........ 77 78 79 76 Level−1 sequence 21 6 14 10 21 6 14 10 21 15 45 33 14 10 25 6110 8 86867 9 2 Level−0 sequence 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 Fig. 4. Sequence Re-writing Step pattern that has the highest matching score starting at location j in the input sequence is termed as the dominant pattern starting at position j. The term dominant pattern is coined from the fact that this pattern dominates over all other significant patterns for this position in the sequence. The matching score is calculated as follows:   1 if Pi = Sv j score(Pi, Sv j)= ǫ if Pi = GAP,ǫ< 1  0 Pi 6= Sv j where Pi is the ith token of the pattern and j is an index over sequence S. ǫ is intended to capture the notion that a match with a gap is not as good as an exact match but much better than a mismatch. Dominant patterns at consecutive positions over the sequence can be merged to form longer dominant patterns. A new token is assigned to each dominant pattern. During the second pass of the sequence at level v, the sequence for level v + 1 is formed. The sequence corresponding to a dominant pattern is replaced by the new token for this dominant pattern. Unmatched tokens are copied from sequence Sv to the new sequence Sv+1. Figure 4 illustrates this step as well as the merging process. The training phase uses the dominant patterns generated at each level to construct an ensemble of classifiers, one for each level. The classifiers can be created from any machine learning method, such as Naïve Bayes or Support Vector Machine. Given a set of input sequences, along with the class labels, dominant patterns are generated for each label starting at level 0 to level max_level. The union of all dominant patterns at a level v across all input sequences forms the set of feature for classifier at level v. For the ensemble of classifiers, the final posterior class probability is the weighted sum of the class probabilities of individual classifiers. Each classifier is assigned a weight that reflects the confidence of the classifier. The original input sequences are further split into a new training set and a tuning set. Each classifier in the ensemble trains its model based on the new training set. The confidence value of classifier at level v, conf (Cv), is defined as the relative accuracy on the tuning set. The testing phase follows the training phase in terms of the level by level operating strategy. Initially, the frequency of level 0 dominant patterns is counted over the level 0 test sequence. This vector of frequencies forms the feature vector at level 0. Once the level v is done, the next level sequence is generated following the rewriting rules for dominant patterns. This process continues till all levels of dominant patterns are exhausted. Each classifier in the ensemble classifies the test data and the final probability of class C given a sequence x is assigned based on the following weighing scheme max_levels P(C |x)= conf (Ci)× PCi (C |x) i=1 where x is a test sequence and PCi (C | x) is the posterior probability assigned by classifier Ci. 5. RDM Experiments and Results We compare performance of RDM with performances of classifiers based on, respectively Naïve Bayes, Support Vector Machines and Predictive Association Rules (CPAR) [15] that combines the advantages of associative and traditional rule- based classifiers. Support Vector Machines based classifiers performed well on text classification tasks [16]. Since RDM does not use any semantic tools (part-of-speech tagging or synonym groups) in extracting patterns, we compare it with techniques that also do not utilize domain or semantic knowledge. email. No efforts were made to correct spelling errors or to expand abbreviations in attempt to reduce the noise. Our experiments were done on the March 2, 2004 0.5 version of Enron dataset, distributed by William Cohen, http://www.cs.cmu.edu/˜enron/. The dataset was 0.4 cleaned to eliminate attachments, quoted text and tables from the body of the email messages and header fields from the 0.3 For our purpose of identifying roles, employees were partitioned into groups based on their organizational role in Enron, as suggested in [17]. Only the roles CEO, Manager, Trader and Vice-president were used in our experiments thanks to a large number of employees playing these roles. Since we are concerned with identifying roles based on the sent messages we only deal with the messages in the Sent folder of each participant. For each of these roles, the emails are divided into two sets as summarized in Table I. TABLE I Dataset for Role Identification Training Testing Total # Sent Set Set folders CEO 1010 250 1260 3 Manager 1403 349 1752 4 Trader 654 162 816 4 VP 1316 327 1643 4 Total 4383 1088 5471 15 0.5 0.2 0.1 0 Naive Bayes RDM with NB SVM RDM with SVM CPAR CEO Manager Trader Vice-president Fig. 6. Multi-class Classification – F-measure in performances. In terms of the F-measure, RDM with NB and SVM performed better than or as good as NB and SVM respectively, for all cases. RDM is better than CPAR under all settings. 1-Fm The second set of results compares the performance under the multi-class classification test, where the task is to answer “Which is the most likely role, out of roles R1,..., Rn, for sender of message m?" For NB and RDM with NB, the training data is split into 4 groups and probabilities computed for each of the roles. For SVM and RDM with SVM, four sets of datasets are generated, one each for role [r]-[R\r] pairs. The results are shown in figure 6. 0.2 0.1 0 CEO Manager Trader Vice-president 1-Fm Naive Bayes RDM with NB SVM RDM with SVM CPAR Fig. 5. Binary Classification – F-measure 0.4 0.3 To further investigate the results obtained for the multi- class scenario, we performed the paired t-test for statistical significance. The paired t-test provides a hypothesis test of the difference between population means for a pair of ran dom samples whose differences are approximately normally distributed. The null hypothesis H0 states that the difference in performance between RDM and the other methods is not significant. The alternate hypothesis states otherwise. A 20fold cross validation was performed on the data. The accuracy results obtained therein are used for the t-test. The results are shown in table II. Based on the p-value in Table II we reject the null hypothesis, indicating a definite improvement with RDM. The confidence interval for the mean difference shows that the improvement lies between 1.8% and 3% as compared to NB whereas as compared to SVM (and CPAR) it lies between 8% We used SVMLight to implement SVM and IlliMine package to implement CPAR. RDM was used both with Naïve Bayes, and SVM in ensemble of classifiers. In the binary test, given a test message m, the task is to answer “Is message m sent by a person with role r"? where r ∈ R ={CEO, Manager, Trader, Vice − president}. To compare performance of the classifiers, we used F-measure that is the harmonic mean of precision and recall, so higher F-measure, better the performance. The performance for the compared classifiers is shown in Figure 5, where the values of 1 -F-measure are presented to highlight the differences and 10%. TABLE II Results of paired t-test Classifier Pair p-value 95% confidence interval NB vs RDM 1.23E-08 (0.0186 -0.0292) SVM vs RDM 1.94E-14 (0.0818 -0.0984) CPAR vs RDM 3.74E-13 (0.0821 -0.1045) 6. Conclusion We present here set of tools based on SIGHTS and RDM. SIGHTS discovers and analyzes hidden social groups on the internet and in cyberspace and then monitors their evolution. It includes the user friendly visualization tool. Examples of SIGHTS application to social groups evident in Enron-email and Blog data are given. A complementary tool, RDM provides a general framework for identifying hierarchical patterns in a sequence of tokens. RDM was applied to role recognition from the Enron dataset email. Currently, we are integrating our algorithms with BlackBook-2 [18] which is a software system that integrates the data sources and enables interactive execution of its algorithms. It also offers extended visualization capabilities which we plan to leverage and enhance. Acknowledgment This work was partially supported by the ONR Contract N00014-06-1-0466 and also by the U.S. National Science Foundation (NSF) under Grants IIS-0634875, IIS-0621303, IIS-0522672, ITR 0324947 and by the U.S. Department of Homeland Security (DHS) through the Center for Dynamic Data Analysis for Homeland Security administered through ONR grant number N00014-07-1-0150 to Rutgers University. The content of this paper does not necessarily reflect the position or policy of the U.S. Government, no official endorsement should be inferred or implied. References [1] M. E. Newman, “The structure and function of complex networks,” SIAM Review, 45(2): 167–256, 2003. [2] T. Berger-Wolf and J. Saia, “A framework for analysis of dynamic social networks,” DIMACS Technical Report, vol. 28, 2005. [3] J. Sinai, “Combating terrorism insurgency resolution software,” Proc. IEEE Int. Conf. Intelligence and Security Informatics(ISI-2006), pp. 401–406. [4] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group formation in large social networks: membership, growth, and evolution,” Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2006, pp. 44–54. [5] D. Kempe, J. M. Kleinberg, and E. Tardos, “Influential nodes in a diffusion model for social networks.” Proc. ICALP, 2005, pp. 1127–1138. [6] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of influence through a social network,” in Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2003, pp. 137–146. [7] T. Li and M. Ogihara, “Music artist style identification by semi-supervised learning from both lyrics and content,” Proc. 12th Annual ACM Int. Conf. Multimedia, 2004. [8] W. Lee and S. J. Stolfo, “Data mining approaches for intrusion detection,” Proc. 7th USENIX Security Symposium, 1998, pp. 79–93. [9] K. Sripanidkulchai, B. Maggs, and H. Zhang, “Efficient content location using interest-based locality in peer-topeer systems,” Proc. IEEE INFOCOM, 2003. [10] J. Baumes, M. Goldberg, M. Hayvanovych, M. Magdon- Ismail, and W. A. Wallace, “Finding hidden groups in a stream of communications,” Proc. IEEE Int. Conf. Intelligence and Security Informatics (ISI-2006), 2006, pp. 201–212. [11] A. Camptepe, M. Goldberg, M. Magdon-Ismail, and M. Krishnamoorthy, “Detecting conversing groups of chatters: a model, algorithms and tests,” Proc. IADIS Int. Conf. Applied Computing 2005, pp. 145–157. [12] J. Baumes, M. Goldberg, and M. Magdon-Ismail, “Efficient identification of overlapping communities,” Proc. IEEE Int. Conf. Intelligence and Security Informatics, 2005, pp. 27–36. [13] J. Baumes, M. Goldberg, M. Krishnamoorthy, M. Magdon-Ismail, and N. Preston, “Finding comminities by clustering a graph into overlapping subgraphs,” Proc. IADIS Int. Conf. Applied Computing 2005, pp. 97–104. [14] B. Szymanski and Y. Zhang, “Recursive data mining for masquerade detection and author identification,” Proc. 5th IEEE System, Man and Cybernetics Information Assurance Workshop, West Point, NY, June 2004, pp. 424–431. [15] X. Yin and J. Han, “Cpar: Classification based on predictive association rules,” Proc. SIAM Int. Conf. Data Mining, 2003. [16] T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” Proc. 10th European Conf. Machine Learning, 1998, pp. 137– 142. [17] see http://rabasrv.jhuapl.edu/karma/index.php/Main_Page [18] see http://rabasrv.jhuapl.edu/karma/index.php/Main_Page (a password is required to access the page).