Thread: Hexadecimal Characters in RegEx

  1. #1
    Registered User
    Join Date
    Apr 2008
    Posts
    12

    Hexadecimal Characters in RegEx

    Hi there,

    I'm trying to match a portion of a network packet (captured with libpcap) using RegEx. In this case, I want to identify if the packet is a DNS packet. Here is an example trace that I would like to match:

    Code:
    15:29:26.792551 IP 70.85.31.142.32868 > 67.18.92.7.53: 11852+ A? www.cnn.com. (29)
            0x0000:  4500 0039 0000 4000 4011 35b8 4655 1f8e
            0x0010:  4312 5c07 8064 0035 0025 e930 2e4c 0100
            0x0020:  0001 0000 0000 0000 0377 7777 0363 6e6e
            0x0030:  0363 6f6d 0000 0100 01
    The www.cnn.com is pretty clear (77 7777 0363 6e6e 0363 6fdf). And I believe the third character from the end (0x01 - two bytes after the end of the domain name) indicates this is an A record request.

    A regex of:
    Code:
    \\x63\\x6f\\x6d\\x00\\x00\\x01
    should match this packet, but I can't seem to get it to work. Any ideas? Here is the code (note that the other matches work fine, but they are just matching ASCII characters):

    Code:
    #include <sys/types.h>
    #include <regex.h>
    #include "../common/common.h"
    
    
    _Bool is_http(const u_char * payload);
    _Bool is_ldap(const u_char * payload);
    _Bool is_syslog(const u_char * payload);
    _Bool is_dns(const u_char * payload);
    _Bool evaluate(const u_char * payload, regex_t * pattern);
    
    regex_t http_compiled, ldap_compiled, syslog_compiled, dns_compiled;
    _Bool compiled = 0;
    
    void compile() {
    	regcomp (&http_compiled, "(GET / HTTP)|(HTTP/1\\.1)", REG_EXTENDED);
    	regcomp (&ldap_compiled, "(cn=\\w*,?)|(dc=\\w*,?)|(ou=\\w*,?)", REG_EXTENDED);
    	regcomp (&syslog_compiled, "<[0-9][0-9]?>\\w|/", REG_EXTENDED);
    	regcomp (&dns_compiled, "\\x63\\x6f\\x6d\\x00\\x00\\x01", REG_EXTENDED); //matches 'A' record for .com
    	compiled = 1;
    }
    
    _Bool is_http(const u_char * payload) {
    	if(!compiled) compile();
    	return(evaluate(payload, &http_compiled));
    }
    
    _Bool is_ldap(const u_char * payload) {
    	if(!compiled) compile();
    	return(evaluate(payload, &ldap_compiled));
    }
    
    _Bool is_syslog(const u_char * payload) {
    	if(!compiled) compile();
    	return(evaluate(payload, &syslog_compiled));
    }
    
    _Bool is_dns(const u_char * payload) {
    	if(!compiled) compile();
    	return(evaluate(payload, &dns_compiled));
    }
    
    _Bool evaluate(const u_char * payload, regex_t * pattern) {
    	if((regexec (pattern, payload, 0, NULL, 0)) == 0) {
    		return 1;
    	} else return 0;
    }
    Thanks in advance for your help!
    Last edited by maestro371; 04-11-2008 at 04:51 PM. Reason: mis-typed code

  2. #2
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    This was not trivial, but I got it working. Only took me about 3 hours.

    The exercise of implementing for your situation is left to you.

    The keys to getting it working were specifying REG_STARTEND, REG_PEND and pmatch, and all their associated required settings. And, of course, correcting the pattern.

    You're welcome!

    Todd
    Last edited by Dino; 04-12-2008 at 03:46 AM. Reason: glad you got the code, now I'm removing it.
    Mainframe assembler programmer by trade. C coder when I can.

  3. #3
    Registered User
    Join Date
    Apr 2008
    Posts
    12

    Fantastic

    There are a number of elements here that are new to me; I'll have to spend some time digging through the regex structures and functions a bit more to fully grasp it.

    Thanks for taking the time to work through this!

  4. #4
    Registered User
    Join Date
    Apr 2008
    Posts
    12
    Quote Originally Posted by Todd Burch View Post
    This was not trivial, but I got it working. Only took me about 3 hours.

    The exercise of implementing for your situation is left to you.

    The keys to getting it working were specifying REG_STARTEND, REG_PEND and pmatch, and all their associated required settings. And, of course, correcting the pattern.

    You're welcome!

    Todd
    It looks like I might have to switch my development to a BSD variant. I've been working on Ubuntu (Gutsy) Linux and Debian (Etch). The REG_PEND, re_endp, and REG_ITOA elements all seem to be implementation choices that haven't been integrated into Linux yet (from what I can tell).

    Those portions of the code compile and execute perfectly on my Mac OS X client but not on the target build system. I'll dig a bit more to find working alternatives for Linux, but might end up just loading FreeBSD on my target system to see if that's a better fit. Most of my other code should be pretty portable.

    I'll let you know what I learn!

  5. #5
    Registered User
    Join Date
    Apr 2008
    Posts
    12

    Slight Alterations

    I started just tweaking your code a bit to remove the parts that made Linux stumble and appear to have made it work, although I'm still investigating the ramifications. Here's what I changed:

    1. I defined REG_ITOA as 0400 (#define REG_ITOA 0400) because I found a reference to that value in a Google search.
    2. I changed "REG_PEND" to "0".
    3. I pointed "myregex->re_endp" to "myregex->buffer".

    I think that last one might be a really bad idea, and I'm reading about what those structure elements all do. From an initial glance, it looked like "buffer" and "re_endp" might serve similar purposes, but I could be completely wrong.

    I'll keep working with it.

  6. #6
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    I too have a Mac with OS X and it worked just like I expected it. I changed the regex pattern, and it failed. I changed the data pattern, and it failed. When both were as intended, it worked.

    The nulls in both the pattern and regex require the use of the additional option flags and address settings.

    Good luck!
    Mainframe assembler programmer by trade. C coder when I can.

  7. #7
    Registered User
    Join Date
    Apr 2008
    Posts
    12
    Quote Originally Posted by Todd Burch View Post
    I too have a Mac with OS X and it worked just like I expected it. I changed the regex pattern, and it failed. I changed the data pattern, and it failed. When both were as intended, it worked.

    The nulls in both the pattern and regex require the use of the additional option flags and address settings.

    Good luck!
    Okay, I think I have a fairly sustainable solution without using the REG_PEND. After you tipped me off on the null issue, I decided to try altering the null characters in a predictable way before they were passed to the REGEX engine. Here's what I've got:

    Code:
    char pattern[] = { '\\', 'w', '\\', 'w', 0x0a, 0x0a, 0x01, 0x0a, 0x01 };	
    	
    u_int8_t data[size];
    printf("Evaluating: ");
    int i;
    for( i = 0; i < size; i++ ) {
    	if((*(payload + i)) == 0x00) data[i] = 0x0A;
    	else data[i] = *(payload + i);
    	printf("0x%x ", data[i]);
    }
    printf("\n");
    I of course altered the pattern to match (and I also switched from looking for "com" to any two word characters prior to the first expected null to catch "org" or "uk" or "cn", etc.). I put this in a new character array because I'm passing the original as a const (the payload pointer points to the data pulled directly off the wire). The printf is just so that I can see the transformation.

    The data array is u_int8_t because I found that 'char' causes the 0xff values to be printed as 0xffffffff. I found that really annoying - I guess 'char' in linux is 32 bits? Seems odd. I thought 'char's were always 1 byte. Here's the output that perplexed me before assigning that to an 8 bit int:

    Evaluating: 0xa 0x2 0xa 0xc 0x1 0x4 0xa 0xa 0xa 0x1 0x2 0x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0x4e 0x50 0x7f 0x35

    Anyway, that's a rabbit trail. All of this seems to bypass that null problem pretty well. Do you see any possible issues with this approach? I'm quite new to C, so I'm still actively learning the little gotchas.

  8. #8
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    You can switch to unsigned char and that should get you back to 0xff from 0xffffffff. What is happening is the sign of the 1 byte is be propagated throughout the rest of the word when converting to int. And char value > 0x80 causes this.
    Mainframe assembler programmer by trade. C coder when I can.

  9. #9
    Registered User
    Join Date
    Apr 2008
    Posts
    12
    Great! Thanks for the tip; if I understand correctly, numbers 0x80 and greater (> 127 decimal) extend beyond the boundary of a short int and cause the compiler to use a standard int (32 bits) instead.

  10. #10
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    No, good guess, but wrong.

    A byte is represented by 8 bits, or, an "octet". 8 bits is 2^8, or, 256, which is 256 values represented by 0-255.

    A short int on modern computers is usually 2 bytes, aka 16 bits, and called a "short" (a "halfword" on some platforms) and is 2^16, or 65,536, which is 65,536 values or 0-65,535. AS you can see, a decimal 128 its quite easily into a short.

    In a byte, when the sign bit is on (the 0x80 bit), and the value is moved into a larger field (a short int or an int or long long,), the sign bit is extended to the left. To keep the sign bit from being extended, you have to use unsigned values.
    Mainframe assembler programmer by trade. C coder when I can.

  11. #11
    Registered User
    Join Date
    Apr 2008
    Posts
    12
    Quote Originally Posted by Todd Burch View Post
    No, good guess, but wrong.

    A byte is represented by 8 bits, or, an "octet". 8 bits is 2^8, or, 256, which is 256 values represented by 0-255.

    A short int on modern computers is usually 2 bytes, aka 16 bits, and called a "short" (a "halfword" on some platforms) and is 2^16, or 65,536, which is 65,536 values or 0-65,535. AS you can see, a decimal 128 its quite easily into a short.

    In a byte, when the sign bit is on (the 0x80 bit), and the value is moved into a larger field (a short int or an int or long long,), the sign bit is extended to the left. To keep the sign bit from being extended, you have to use unsigned values.
    Got it. Thanks for the correction.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. A development process
    By Noir in forum C Programming
    Replies: 37
    Last Post: 07-10-2011, 10:39 PM
  2. Replies: 10
    Last Post: 07-10-2008, 03:45 PM
  3. How do you check how many characters a user has entered?
    By engstudent363 in forum C Programming
    Replies: 5
    Last Post: 04-08-2008, 06:05 AM
  4. help with text input
    By Alphawaves in forum C Programming
    Replies: 8
    Last Post: 04-08-2007, 04:54 PM
  5. Characters to binary... hexadecimal to binary
    By Trauts in forum C++ Programming
    Replies: 48
    Last Post: 10-27-2002, 05:03 PM