README for CAVES Project This README file has four parts. The first part describes the basic function of each file. The second part shows how to compile and run the project. The third part gives details about setting up configuration file for running tests. Finally, the fourth part describes how to run the TPC-H query generator. For the description about the architecture of the system and details about TPC-H, please refer to the text (chapters and appendix) in the thesis. ================================================================================ 1. File descriptions The main files in this project have filenames start with "caves-". Filenames start with "ross" or "tw" are system files from ROSS (Time Warp) Project, and these files are not described here. For more information about ROSS project and Time Warp system, please refer to http://www.cs.rpi.edu/~chrisc/. Here are basic descriptions about the main files in the CAVES project. Main files: caves-init.c - initialize the CAVES system: reading log files, setting up query types, allocating memory blocks, initializing cache, etc. caves-main.c - the main program of the CAVES system. It handles all the Time Warp events and the final output. caves-policy.c - defines caching policies (weighted formula) for the CAVES system. Secondary files: caves-types.h - defines data types used in the CAVES project. For example: views, statistics, database servers, etc. caves-cache.c - handles cache activity, such as enqueue, dequeue, hit, miss, etc. caves-sort.c - functions that handles sorting cache blocks. caves-view.c - a single function to return the specified view. caves.h - a single header file to declare all header files used in the project. caves-extern.h - declare external variables and functions to be used in the program. caves-global.c - declare global variables that are used in multiple program files. caves-sd.c - handles standard deviation computing caves-block.c - for testing only: print out block id to make sure the program works correctly Usually, only the three main files need to be tuned to fit some testing scenarios. For example, changing caching policy needs to modify caves-policy.c; changing output data needs to modify the caves-main.c; changing data sources needs to modify caves-init.c. In some cases, additional data types/functions can be added to system to adjust the program behavior. In this case, data types should be declared in caves-types.h. If a global variable is necessary, it should be declared in caves-global.c and make it an external variable in caves-extern.h. All other files should not need any attention unless fundamental behavior of the system is changed. ================================================================================ 2. Compilation and Running 2.1 Compilation The CAVES project (simulation codes) were developed under "Rensselaer High Performance Computing Cluster". For more information about the configuration of the cluster, see http://hd-03.cipr.rpi.edu/equipment.html. To compile the project, run the following command make caves To clean all compiled components in the project, run make clean It is suggested that this project run under Linux machines. To port it to other types of UNIX (eg. Solaris or BSD) systems, some changes should be made to the Makefile. 2.2 Running To run the simulation program, a configuration file needs to be provided. Suppose the configuration file is called "test", run the following command: ./caves test > output.test.txt It is recommended that output be saved to a text file for further analysis. ================================================================================ 3. Configuration file To setup a test, one must create a configuration file for the simulation program. Following is an example of the configuration file "sample-config": 86400000 //SimulationTime 1000 //ReorderMeanTime 10 //QueryRequestMeanTime 5 //NumberOfClients 10000 //changeTurnAroundMeanTime 1600 //cacheSize 1 //numberOfDBServers 10 //kSize 0 //logOnOff 0 //dynamicOnOff 0 //knap 500 //windowSize 0 //entry 0 //remove //Weights 0.300000 0.000000 0.700000 0.000000 0.000000 0.000000 0.000000 //DB_S_MinNetworkSpeed //DB_S_MaxNetworkSpeed //DB_S_MeanNetworkSpeed //DB-S_SdNetworkSpeed //DB_S_DiskSpeed 0.001000 10 1 0 25 0 0 1 50 //Diskspeed 15 //K 1 //NumberofTypes 1 //stuff //NumberOfViewForType //# //distribution_infor //MinViewSize //MaxViewSize //MinViewFF //MaxViewFF //MeanViewCC //MaxViewCC 10000 100 unif 20 30 1 2 200 300 For most lines, the first field is the value of certain parameter, and the second one is for commenting only (but it is necessary to keep it this way since the caves-init.c also reads the second field). Basically, time unit in the CAVES system (and the configuration file) is 10ms unless stated otherwise. Size unit in the system is always in K bytes. Here are the descriptions of each line in the configuration file: 1st line: simulation time in 10ms. Thus, for a 10-day simulation, the value should be 86400000. 2nd line: the mean time for a cache resort. Although cache is a priority queue and it is sorted based on the time of cache activities, but the value may change because of system (environment) variables. Thus, cache resort is necessary. It also effects the frequency of output. 3rd line: query request mean time 4th line: number of clients sending query requests 5th line: the duration before change Turn Around Mean Time 6th line: size of cache in K bytes 7th line: number of database servers 8th line: k Size 9th line: turning log output on/off 10th line: turning dynamic changing caching policy on/off 11th line: turning knapsack problem handling on/off 12th line: the Time Window size. The size is based on the number of cache resort. A value of 500 means statistics will be printed every 500 cache resorts. 13th line: turning cache entry criteria on/off 14th line: turning remove on/off 15th line: comment only 16th-22nd lines: weight combination of 7 caching policies. The sum of the 7 values must be 1. 23th-27th lines: comment only (describe the format of database server) 28th line: defines the minimum network speed of a database server 29th line: defines the maximum network speed of a database server 30th line: defines the mean network speed of a database server 31st line: defines the standard deviation of network speed of a database server 32nd-35th line: defines the disk speed, dl1, dl2, and threshold of a database server 36th line: define disk speed of cache server 37th line: define the value K. K is the number of most recent records that will be kept in the program for some time dependent statistics, such as hits. 38th line: number of query types 39th line: number of workloads 40th-48th lines: comment only (describe the format of data type) 49th line: number of different views in this query type 50th line: the percentage of this query type in the workload 51st line: the type of query distribution. Values can be unif means uniform distribution norm means normal distribution* zipf means zipf distribution tpch means tpch distribution** 52nd line*: the minimum size of a view 53th line*: the maximum size of a view 54th line*: the minimum value of filter factor 55th line*: the maximum value of filter factor 56th line*: the minimum value of computation cost for a view 57th line*: the maximum value of computation cost for a view *. In this case, the 52nd line should be a value to represent the shape of the Normal distribution. Suppose it is a normal-8 distribution and the value of 49th line is 10000, then the value should be 800. For more detail, please refer to the thesis. Also, the 52nd-57th lines should be 53th-58th lines in this case. **. Only when tpch is implemented. In this mode, the next 6 lines have no effect since the distribution is pre-defined based on TPC-H. The main difference in the tpch implementation is in caves-init.c. It defines all the view sizes and computation costs in two arrays. For more information about this setting, please refer to the thesis. To facilitate tests, one may use a program to generate many configuration files in a short time (such as using generator_v3.c) and use a program to collect output data into a single file (such as using create_output_v3.c) for analysis. Also, use a scirpt to run these tests will simplify the task. For example, script testrun-1 may have the following contents: ./caves 1 > 1.out ./caves 2 > 2.out ./caves 3 > 3.out ./caves 4 > 4.out ./caves 5 > 5.out ...