A Sample Parser for GML

Marcus Raitner
Lehrstuhl für Theoretische Informatik
Universität Passau
94032 Passau
raitner@fmi.uni-passau.de
Michael Himsolt
Lehrstuhl für Theoretische Informatik
Universität Passau
94032 Passau
himsolt@fmi.uni-passau.de

Table of Contents

  1. Introduction
  2. Scanner
    1. GML_scanner procedure
    2. GML_token structure
    3. GML_value enumeration (scanner related entries)
    4. GML_tok_val structure
  3. Parser
    1. GML_parser procedure
    2. GML_stat structure
    3. GML_pair structure
    4. GML_value enumeration (parser related entries)
    5. GML_pair_val structure
    6. GML_free_list procedure
    7. GML_print_list procedure
  4. Error Handling
    1. GML_init procedure
    2. GML_error structure
    3. GML_error enumeration
  5. Examples
    1. gml_demo programm

Introduction

GML, the Graph Modelling Language, is a portable file format for graphs. GML has been developed as part of the Graphlet system, and has been implemented in several other systems, including LEDA, GraVis and VGJ.

This document describes a sample scanner and parser for GML. Unlike other implementations, this one uses ANSI C and does not rely on external tools such as lex and yacc. This implementation is also designed to be highly portable and can be used as a library.


Scanner

The procedure GML_scanner implements the scanner for GML files:

struct GML_token GML_scanner (FILE* file);

GML_scanner reads the next input token from file and returns it in a GML_token structure. file must be open for read access; the caller is responsible for opening anc closing the file.

The type GML_token is defined as follows:

struct GML_token { 
    GML_value kind;
    union tok_val value;
};

Where kind determines the type of the token and value is its value. kind is of type GML_value, which is listed in Table 1.

GML_KEY
token is the name of a key
data in value.string
GML_INT
token is integer number
data in value.integer
GML_DOUBLE
token is floating point number
data in value.floating
GML_STRING
token is a string
data in value.string
GML_L_BRACKET
token is left bracket
no data in value
GML_R_BARCKET
token is right bracket
no data in value
GML_END
EOF was reached (value undefined)
no data in value
GML_ERROR
error occured while parsing
additional information in value.error
Table 1. Enumeration GML_value, scanner related entries

The value field in GML_token is of type GML_tok_val, which is defined as follows:

union GML_tok_val {
    long integer;          // used with GML_INT
    double floating;       // used with GML_DOUBLE
    char* string;          // used with GML_STRING, GML_KEY
    struct GML_error err;  // used with GML_ERROR
};

Parser

The procedure GML_parser implements the parser for GML files:

struct GML_pair* GML_parser (FILE* file,
    struct GML_stat* stat,
    int mode);

Input parameters for GML_parser are the file, a pointer to a GML_stat structure and and the operations mode. file must be open for reading; the caller is responsible for opening and closing the file. stat must point to a structure of type GML_stat, which is defined as follows:

struct GML_stat {
    struct GML_error err;
    struct GML_list_elem* key_list;
};
The variable err is used to report errors (for information on GML_error see below) If an error occurs during parsing, stat->err.err_num is set to the corresponding error code, and additional information is written into the data structure pointed to by stat->err. If no error occurs, then stat->err.err_num has the value GML_OK. mode is almost always 0. The other field in GML_stat is key_list, a pointer to a singly linked list of the strings used for keys. You can access the first key-string with key_list->key and the next node with key_list->next.

The parameter mode needs further clarification. GML_parser parses lists recursively. Therefore, A closing square bracket (]) means the end of a list in a recusive call and a syntax error (GML_TOO_MANY_BRACKETS) at the top level. Therefore, mode is used to discriminate between the top level (mode == 0) and a recursion step (mode == 1). 0 at the top level and and 1 in a recursion step.

GML_parser returns a structure of type GML_pair, which is defined as follows:

struct GML_pair {
    char* key;
    GML_value kind;
    union pair_val value;
    struct GML_pair* next;
};

Each object in a GML_pair structure corresponds to a key-value pair in the GML file, where key is a pointer to the key, and kind and value hold the value. For example, the sequence "id 42" translates into a GML_pair structure where key is "id", kind is GML_INT and value.integer is 42. next implements GML lists. next is a pointer to the next element within the current in the list, or NULL if there are no more elements in the list.

The field kind determines which of the fields in value is used. kind is of type GML_value, which is listed in Table 2.

GML_INT
value is a integer number
data in value.integer
GML_DOUBLE
value is a floating point number
data in value.floating
GML_STRING
value is a string
data in value.string
GML_LIST
value is a list of key-value pairs
data in value.list
Table 2. Enumeration GML_value, parser related entries

The data structure GML_pair_val is defined as follows:

union GML_pair_val {
    long integer;          // kind is GML_INT
    double floating;       // kind is GML_DOUBLE
    char* string;          // kind is GML_STRING
    struct GML_pair* list; // kind is GML_LIST
};

Note: string contains no characters with ASCII code greater than 127 because these are converted into the iso8859-1-format. See the GML Manual for details.

The following auxiliary procedures are defined for GML_pair:

void GML_free_list (struct GML_pair* list, struct GML_list_elem* key_list)

Frees recursivly all storage allocated for list and for key_list (which is decribed above).

void GML_print_list (struct GML_pair* list, int level)

Writes list to stdout, using level for indentation. This meant for debugging only.


Error Handling

The currently read line and column are stored in the global variables GML_line and GML_column. If you are interested in where an error occured, you should call

void GML_init ()
before calling parser or scanner the first time. It will set both variables to 1.

The procedures GML_scanner and GML_parser read until they find an error, or an end of file. If the parser encounters an error, it returns the GML structure parsed so far and provides error information in its error parameter. The structure GML_error reports scanner and parser errors:

struct GML_error {
    GML_error_value err_num;
    int line;
    int column;
};

line is the input line in which the error occured, and column is the corresponding column. Both line and column start at 1. err_num is of type GT_error_value, which is listed in Table 3.

Table 3. Enumeration GT_error_value
GML_UNEXPECTED
unexpected charcter was found
GML_SYNTAX
Broken key-value structure
GML_PREMATURE_EOF
End of file encountered while reading a string
GML_TOO_MANY_DIGITS
A number has too many digits (that is, more than 1024 in the current implementation).
GML_OPEN_BRACKET
At least one bracket not closed at EOF
GML_OK
No errors occured

Examples

The following example reads a GML file (filename is specified in the command line) and writes the parsed key-value pairs and the list of keys to standard output.

#include "gml_parser.h"
#include <stdio.h>
#include <stdlib.h>

void print_keys (struct GML_list_elem* list) {
    
    while (list) {
        printf ("%s\n", list->key);
        list = list->next;
    }
}

void main (int argc, char* argv[]) {
  
    struct GML_pair* list;
    struct GML_stat* stat=(struct GML_stat*)malloc(sizeof(struct GML_stat));
    stat->key_list = NULL;

    if (argc != 2) printf ("Usage: gml_demo <gml_file> \n");
    else {
        FILE* file = fopen (argv[1], "r");
        if (file == 0) printf ("\n No such file: %s", argv[1]);
        else {
            GML_init ();
            list = GML_parser (file, stat, 0);

            if (stat->err.err_num != GML_OK) {
                printf ("An error occured while reading line %d column
                %d of %s:\n", stat->err.line, stat->err.column, argv[1]);
                
                switch (stat->err.err_num) {
                case GML_UNEXPECTED:
                    printf ("UNEXPECTED CHARACTER");
                    break;
                    
                case GML_SYNTAX:
                    printf ("SYNTAX ERROR"); 
                    break;
                    
                case GML_PREMATURE_EOF:
                    printf ("PREMATURE EOF IN STRING");
                    break;
                    
                case GML_TOO_MANY_DIGITS:
                    printf ("NUMBER WITH TOO MANY DIGITS");
                    break;
                    
                case GML_OPEN_BRACKET:
                    printf ("OPEN BRACKETS LEFT AT EOF");
                    break;
                    
                case GML_TOO_MANY_BRACKETS:
                    printf ("TOO MANY CLOSING BRACKETS");
                    break;
                
                default:
                    break;
                }
                
                printf ("\n");
            }      
            GML_print_list (list, 0);
            printf ("Keys used in %s: \n", argv[1]);
            print_keys (stat->key_list);
            GML_free_list (list, stat->key_list);
        }
    }
}

Marcus Raitner, Michael Himsolt