This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/. This program is an implementation of the algorithms described in "Extracting Hidden Groups and their Structure from Streaming Interaction Data" by Mark K. Goldberg, Mykola Hayvanovych, Malik Magdon-Ismail, and William A. Wallace. This program may be used to find chain and sibling communication triples from interaction data. There are two sub-programs that may be run with this program: The first is used to determine at what frequency triples must occur in order to be considered significant (called the "significance threshold"). This is done by generating lists of random communications based on the original data and finding the number of triples that occur in these lists. The second sub-program seperates the triples into cliques and may be used to find larger communication-structures in the data. If you have found a bug or have a problem, please contact the programmer: Ben Caulfield caulfb2@rpi.edu ------------------How to Compile--------------------- To compile, type "make" (without quotes) in the Hidden_Groups directory. -----------------How to Run-------------------------- To run the program, type: ./handle_data.exe input_file triple_type [minimum maximum delta] cliques_or_freq [synthetic_repeats std_dev_factor user_significance_threshold] input_file: a file containing a list of communications, where each communication is of the form "sender receiver time" (no quotes). Here, 'sender' and 'receiver' are integers (should be as small as possible) and 'time' is a double. triple_type: type either "chain" "sibling" or "both" to determine what type of triples should be used. if "chain": the next two values should be 'minimum' and 'maximum' if "sibling": the next value should be 'delta' if "both": the next three values should be 'minimum' 'maximum' and 'delta; minimum: a double representing the minimum amount of time that must pass between two communications for them to be considered a chain. maximum: a double representing the maximum amount of time that can pass between two communications for them to be considered a chain. delta: a double representing the amount of time that can pass between two messages from the same sender for them to be considered a sibling. cliques_or_freq: type either "frequency" "cliques" or "both" to determine what sub-programs should be run. "frequency" is used to find the significance-threshold,"cliques" is used to find the cliques for significant triples. if "frequency" or "both": the next two inputs should be 'synthetic_repeats' and 'std_dev_factor' if "cliques": include an integer as the significance-threshold. Keep in mind that a low significance-threshold will greatly increase the necessary computation time, and the program may not even be able to complete the computation. synthetic_repeats: an integer representing the number of random communication lists that will be created to determine the frequency at which triples should be considered significant (called the "significance-threshold"). std_dev_factor: an integer used in calculating the significance-threshold user_significance_threshold: an integer which represents the minimum frequency a triple must occur before it is used in the clustering. ----------------------Testing------------------------------ Test 1: execute the following command: ./handle_data.exe test_input.txt both 0.01 0.03 0.02 frequency 3 3 The contents of handle_data_output.txt should be similar to test1_frequency_output.txt Keep in mind that there is are random factors involved in this test, and the outputs may not be identical. Test 2: execute the following command: ./handle_data.exe test_input.txt chain 0.01 0.03 cliques 3 The contents of clustered_graph.txt should be the same as test2_cluster_output.txt -----------------------Output------------------------------ The program will output information on the frequencies of triples to "handle_data_output.txt". The table is presented in order of increasing frequency, so that for any frequency, f, the value under "actual_data" states the number of distinct triples that occured at least f different times. If the significance-threshold sub-program is called, each row will also include the average number of synthetically created triples that occured at least f times, and the standard-deviation of those occurences. If the clustering sub-program is used, information on the clusters will be written to "clustered_graph.txt" Each line in this file contains a cluster of nodes of the form "a.b.c.t". Here, 'a', 'b', and 'c' are the communicators in the triple, and 't' is a bool that equals 1 if the triple is a chain and 0 if it is a sibling. If the triple is a chain, then there were communications from a to b and from b to c. If it is a sibling, there were communications from a to b and from a to c. -----------Other Alterations---------------------------- If you would like to change the information or style of the output file, go the the end of the function 'Handle_Data' in the file 'Handle_Data.cpp' If you would like to change the way the pseudo-random data is formed when determining the significance-threshold, go to constrained_random_data.cpp