Tech bits and Other Musings
Tuesday, April 13, 2010
 
Parsing VCards using regular expressions in C#

The regular expression class (Regex) in C# is very useful for handling complicated string parsing and for extracting the pieces of the parsed string that you're interested in. However, using Regex in practice can be a little bit of a pain because complex regular expressions are hard to declare and debug.

I've been experimenting with useful ways to declare these expressions so that the declarations are less error-prone, more readable, and also so that it's easier to take advantage of the Group and Capture features of Regex.

I'll illustrate this by taking a look at a simplified example of parsing the fields of a vCard file (these vCard files can be created, for example, by exporting a contact from an email application like Microsoft Outlook).

Notes:

A single content line in a vCard file is defined by the following (simplified) ABNF:
contentline = name *(";" param ) ":" value CRLF 
name = iana-token / x-name 
iana-token = 1*(ALPHA / DIGIT / "-") ; registered with IANA 
x-name = "X-" 1*(ALPHA / DIGIT / "-") ; Reserved for non-standard use 
param = param-name "=" param-value *("," param-value) 
param-name = iana-token / x-name 
param-value = ptext / quoted-string 
ptext = *SAFE-CHAR 
value = *VALUE-CHAR 
quoted-string = DQUOTE QSAFE-CHAR DQUOTE 

NON-ASCII = %x80-FF 
QSAFE-CHAR = WSP / %x21 / %x23-7E / NON-ASCII 
SAFE-CHAR = WSP / %x21 / %x23-2B / %x2D-39 / %x3C-7E / NON-ASCII 
VALUE-CHAR = WSP / VCHAR / NON-ASCII 
WSP = %x20 / %x09 
VCHAR = %x21-7E 
CRLF = %xOD %xOA 
DQUOTE = %x22 
ALPHA = %x41-5A / %x61-7A 
DIGIT = %x30-39

Now, it takes some studying of these definitions to understand everything but once you've got it it's pretty straightforward. To create a Regex object from this you first need to declare the regular expression itself as a string, and this is what I find to be very unwieldy. To get some idea of just how unwieldy it is, here's what the final vCard regular expression looks like:

(([A-Za-z0-9-]+|X-[A-Za-z0-9-]+)(;([A-Za-z0-9-]+|X-[A-Za-z0-9-]+)=([\s\x21\x23-\x2B\x2D-\x39\x3C-\x7E\x80-\xFF]*|\x22[\s\x21\x23-\x7E\x80-\xFF]*\x22)(,([\s\x21\x23-\x2B\x2D-\x39\x3C-\x7E\x80-\xFF]*|\x22[\s\x21\x23-\x7E\x80-\xFF]*\x22))*)*:([\s\x21-x7E\x80-\xFF]*))

Yeah. It's a doozy, right? I mentioned a better way to declare these, and I'll get to that in a minute. For now lets look at some sample vCard fields:

TEL;TYPE=work;TYPE=voice;type=pref;TYPE=msg:+1-213-555-1234 
TEL;TYPE=work,voice;type=pref;TYPE=msg:+1-213-555-1234 

These are two ways of defining the equivalent vCard field. When a field is parsed, there are pieces of the result that you want to extract. Those pieces are:

It turns out that extracting these pieces from the vCard field is easy to do - using match groups - if you define the regular expression properly, which I've done in the example above. Using this regular expression, the name will be contained in group 2, the parameter name in group 4, the parameter values in groups 5 and 7, and the value in group 8. Here's the code:

string line  = @"(([A-Za-z0-9-]+|X-[A-Za-z0-9-]+)(;([A-Za-z0-9-]+|X-[A-Za-z0-9-]+)=([\s\x21\x23-\x2B\x2D-\x39\x3C-\x7E\x80-\xFF]*|\x22[\s\x21\x23-\x7E\x80-\xFF]*\x22)(,([\s\x21\x23-\x2B\x2D-\x39\x3C-\x7E\x80-\xFF]*|\x22[\s\x21\x23-\x7E\x80-\xFF]*\x22))*)*:([\s\x21-x7E\x80-\xFF]*))";
Regex  regex = new Regex(line);
Match  match = regex.Match(@"TEL;TYPE=work,voice;type=pref;TYPE=msg:+1-213-555-1234");
if (match.Success && match.Index == 0)
{
    Console.WriteLine(String.Format("name        ={0}", match.Groups[2].Value));
    Console.WriteLine(String.Format("param name  ={0}", match.Groups[4].Value));
    Console.Write(                  "param values=");
    foreach (Capture c in match.Groups[5].Captures)
        Console.Write(String.Format("{0} ", c.Value));

    foreach (Capture c in match.Groups[7].Captures)
        Console.Write(String.Format("{0} ", c.Value));

    Console.WriteLine();
    Console.WriteLine(String.Format("value       ={0}", match.Groups[8].Value));
}

This code would print out the following:
name =TEL 
param name =TYPE 
param values=work pref msg voice 
value =+1-213-555-1234 

So, this is pretty useful stuff. Regular expressions make it easy to parse the complex vCard fields and extract the pieces of interest. BUT... declaring regular expressions is a pain! I've settled on the following way for doing this and have found it to be great for defining readable expressions, and defining and understanding the groups that you need (groups are defined using parenthesis). Here's how I'd define the regular expression that I showed you above, wrapped in a C# class:

public static class VCardParser
{
    public static bool Parse(string field, 
        ref string name, ref string paramName, ref string[] paramValues, ref string value)
    {
        bool  result = false;
        Match match  = m_fieldRegex.Match(field);
   
        if (match.Success && match.Index == 0)
        {
            name      = match.Groups[2].Value;
            value     = match.Groups[8].Value;
            paramName = match.Groups[4].Value;

            paramValues = new string[match.Groups[5].Length + match.Groups[7].Length];
    
            int n = 0;
            foreach (Capture c in match.Groups[5].Captures)
                paramValues[n++] = c.Value;

            foreach (Capture c in match.Groups[7].Captures)
                paramValues[n++] = c.Value;
     
            result = true;
        }
   
        return(result);
    }

    private static string DQUOTE       = @"\x22";
    private static string IANA_TOKEN   = @"[A-Za-z0-9-]+";
    private static string X_NAME       = @"X-[A-Za-z0-9-]+";
    private static string SAFE_CHAR    = @"[\s\x21\x23-\x2B\x2D-\x39\x3C-\x7E\x80-\xFF]";
    private static string QSAFE_CHAR   = @"[\s\x21\x23-\x7E\x80-\xFF]";
    private static string VALUE_CHAR   = @"[\s\x21-x7E\x80-\xFF]";

    private static string NAME         = IANA_TOKEN + "|" + X_NAME;
    private static string PARAM_NAME   = NAME;
    private static string PARAM_VALUE  = SAFE_CHAR + "*|" + DQUOTE + QSAFE_CHAR + "*" + DQUOTE;
    private static string CONTENT_LINE = 
        "(" +                                     // group 1
            "(" + NAME + ")" +                    // group 2
            "(;" +                                // group 3
                "(" + PARAM_NAME  + ")=" +        // group 4
                "(" + PARAM_VALUE + ")"  +        // group 5
                "(," +                            // group 6
                    "(" + PARAM_VALUE + ")" +     // group 7
                ")*" +
            ")*" +
            ":(" + VALUE_CHAR + "*)" +            // group 8
        ")";
  
    private static Regex m_fieldRegex = new Regex(CONTENT_LINE);
}

This is admittedly a little verbose, but what's gained in readability and ease of defining and seeing the group definitions feels easily worth it to me. Here's an example how you would use this class:

string   name        = null;
string   value       = null;
string   paramName   = null;
string[] paramValues = null;
if (VCardParser.Parse(@"TEL;TYPE=work,voice;type=pref;TYPE=msg:+1-213-555-1234",
        ref name, ref paramName, ref paramValues, ref value))
{
    Console.WriteLine(String.Format("name        ={0}", name));
    Console.WriteLine(String.Format("param name  ={0}", paramName));
    Console.Write(                  "param values=");
    foreach (string s in paramValues)
        Console.Write(String.Format("{0} ", s));
    
    Console.WriteLine();
    Console.WriteLine(String.Format("value       ={0}", value));
}

One last thing: some vCard fields can have parameters with different name. For example, the PHOTO field can have both an ENCODING and a TYPE parameter:

PHOTO;ENCODING=b;TYPE=JPEG:MIICajCCAdOgAwIBAgICBEUwDQYJKoZIhvcN
AQEEBQAwdzELMAkGA1UEBhMCVVMxLDAqBgNVBAoTI05ldHNjYXBlIENvbW11bm
ljYXRpb25zIENvcnBvcmF0aW9uMRwwGgYDVQQLExNJbmZvcm1hdGlvbiBTeXN0
(...remainder of "B" encoded binary data...)

The regular expression above would parse this properly and let you pick out the parameter names and values. However, it would NOT allow you to figure out that the parameter value "b" was associated with the parameter name ENCODING, or that the parameter value JPEG was associated with the parameter value TYPE. I may write another post showing how this can be done.

Have any thoughts to share?

Cheers!

Labels: , , , , , , , , ,

 

Archives
April 2010 /


Powered by Blogger

Subscribe to
Posts [Atom]