Hexadecimal Characters in RegEx

**maestro371** · 04-11-2008

Hi there,

I'm trying to match a portion of a network packet (captured with libpcap) using RegEx. In this case, I want to identify if the packet is a DNS packet. Here is an example trace that I would like to match:

Code:

15:29:26.792551 IP 70.85.31.142.32868 > 67.18.92.7.53: 11852+ A? www.cnn.com. (29)
        0x0000:  4500 0039 0000 4000 4011 35b8 4655 1f8e
        0x0010:  4312 5c07 8064 0035 0025 e930 2e4c 0100
        0x0020:  0001 0000 0000 0000 0377 7777 0363 6e6e
        0x0030:  0363 6f6d 0000 0100 01

The www.cnn.com is pretty clear (77 7777 0363 6e6e 0363 6fdf). And I believe the third character from the end (0x01 - two bytes after the end of the domain name) indicates this is an A record request.

A regex of:

Code:

\\x63\\x6f\\x6d\\x00\\x00\\x01

should match this packet, but I can't seem to get it to work. Any ideas? Here is the code (note that the other matches work fine, but they are just matching ASCII characters):

Code:

#include <sys/types.h>
#include <regex.h>
#include "../common/common.h"


_Bool is_http(const u_char * payload);
_Bool is_ldap(const u_char * payload);
_Bool is_syslog(const u_char * payload);
_Bool is_dns(const u_char * payload);
_Bool evaluate(const u_char * payload, regex_t * pattern);

regex_t http_compiled, ldap_compiled, syslog_compiled, dns_compiled;
_Bool compiled = 0;

void compile() {
	regcomp (&http_compiled, "(GET / HTTP)|(HTTP/1\\.1)", REG_EXTENDED);
	regcomp (&ldap_compiled, "(cn=\\w*,?)|(dc=\\w*,?)|(ou=\\w*,?)", REG_EXTENDED);
	regcomp (&syslog_compiled, "<[0-9][0-9]?>\\w|/", REG_EXTENDED);
	regcomp (&dns_compiled, "\\x63\\x6f\\x6d\\x00\\x00\\x01", REG_EXTENDED); //matches 'A' record for .com
	compiled = 1;
}

_Bool is_http(const u_char * payload) {
	if(!compiled) compile();
	return(evaluate(payload, &http_compiled));
}

_Bool is_ldap(const u_char * payload) {
	if(!compiled) compile();
	return(evaluate(payload, &ldap_compiled));
}

_Bool is_syslog(const u_char * payload) {
	if(!compiled) compile();
	return(evaluate(payload, &syslog_compiled));
}

_Bool is_dns(const u_char * payload) {
	if(!compiled) compile();
	return(evaluate(payload, &dns_compiled));
}

_Bool evaluate(const u_char * payload, regex_t * pattern) {
	if((regexec (pattern, payload, 0, NULL, 0)) == 0) {
		return 1;
	} else return 0;
}

Thanks in advance for your help!

**Dino** · 04-11-2008

This was not trivial, but I got it working. Only took me about 3 hours.

The exercise of implementing for your situation is left to you.

The keys to getting it working were specifying REG_STARTEND, REG_PEND and pmatch, and all their associated required settings. And, of course, correcting the pattern.

You're welcome!

Todd

**maestro371** · 04-12-2008

There are a number of elements here that are new to me; I'll have to spend some time digging through the regex structures and functions a bit more to fully grasp it.

Thanks for taking the time to work through this!

**maestro371** · 04-12-2008

Originally Posted by Todd Burch

This was not trivial, but I got it working. Only took me about 3 hours.

The exercise of implementing for your situation is left to you.

The keys to getting it working were specifying REG_STARTEND, REG_PEND and pmatch, and all their associated required settings. And, of course, correcting the pattern.

You're welcome!

Todd

It looks like I might have to switch my development to a BSD variant. I've been working on Ubuntu (Gutsy) Linux and Debian (Etch). The REG_PEND, re_endp, and REG_ITOA elements all seem to be implementation choices that haven't been integrated into Linux yet (from what I can tell).

Those portions of the code compile and execute perfectly on my Mac OS X client but not on the target build system. I'll dig a bit more to find working alternatives for Linux, but might end up just loading FreeBSD on my target system to see if that's a better fit. Most of my other code should be pretty portable.

I'll let you know what I learn!

**maestro371** · 04-12-2008

I started just tweaking your code a bit to remove the parts that made Linux stumble and appear to have made it work, although I'm still investigating the ramifications. Here's what I changed:

1. I defined REG_ITOA as 0400 (#define REG_ITOA 0400) because I found a reference to that value in a Google search.
2. I changed "REG_PEND" to "0".
3. I pointed "myregex->re_endp" to "myregex->buffer".

I think that last one might be a really bad idea, and I'm reading about what those structure elements all do. From an initial glance, it looked like "buffer" and "re_endp" might serve similar purposes, but I could be completely wrong.

I'll keep working with it.

**Dino** · 04-12-2008

I too have a Mac with OS X and it worked just like I expected it. I changed the regex pattern, and it failed. I changed the data pattern, and it failed. When both were as intended, it worked.

The nulls in both the pattern and regex require the use of the additional option flags and address settings.

Good luck!

**maestro371** · 04-12-2008

Originally Posted by Todd Burch

I too have a Mac with OS X and it worked just like I expected it. I changed the regex pattern, and it failed. I changed the data pattern, and it failed. When both were as intended, it worked.

The nulls in both the pattern and regex require the use of the additional option flags and address settings.

Good luck!

Okay, I think I have a fairly sustainable solution without using the REG_PEND. After you tipped me off on the null issue, I decided to try altering the null characters in a predictable way before they were passed to the REGEX engine. Here's what I've got:

Code:

char pattern[] = { '\\', 'w', '\\', 'w', 0x0a, 0x0a, 0x01, 0x0a, 0x01 };	
	
u_int8_t data[size];
printf("Evaluating: ");
int i;
for( i = 0; i < size; i++ ) {
	if((*(payload + i)) == 0x00) data[i] = 0x0A;
	else data[i] = *(payload + i);
	printf("0x%x ", data[i]);
}
printf("\n");

I of course altered the pattern to match (and I also switched from looking for "com" to any two word characters prior to the first expected null to catch "org" or "uk" or "cn", etc.). I put this in a new character array because I'm passing the original as a const (the payload pointer points to the data pulled directly off the wire). The printf is just so that I can see the transformation.

The data array is u_int8_t because I found that 'char' causes the 0xff values to be printed as 0xffffffff. I found that really annoying - I guess 'char' in linux is 32 bits? Seems odd. I thought 'char's were always 1 byte. Here's the output that perplexed me before assigning that to an 8 bit int:

Evaluating: 0xa 0x2 0xa 0xc 0x1 0x4 0xa 0xa 0xa 0x1 0x2 0x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0x4e 0x50 0x7f 0x35

Anyway, that's a rabbit trail. All of this seems to bypass that null problem pretty well. Do you see any possible issues with this approach? I'm quite new to C, so I'm still actively learning the little gotchas.

**Dino** · 04-13-2008

You can switch to unsigned char and that should get you back to 0xff from 0xffffffff. What is happening is the sign of the 1 byte is be propagated throughout the rest of the word when converting to int. And char value > 0x80 causes this.

**maestro371** · 04-13-2008

Great! Thanks for the tip; if I understand correctly, numbers 0x80 and greater (> 127 decimal) extend beyond the boundary of a short int and cause the compiler to use a standard int (32 bits) instead.

**Dino** · 04-13-2008

No, good guess, but wrong.

A byte is represented by 8 bits, or, an "octet". 8 bits is 2^8, or, 256, which is 256 values represented by 0-255.

A short int on modern computers is usually 2 bytes, aka 16 bits, and called a "short" (a "halfword" on some platforms) and is 2^16, or 65,536, which is 65,536 values or 0-65,535. AS you can see, a decimal 128 its quite easily into a short.

In a byte, when the sign bit is on (the 0x80 bit), and the value is moved into a larger field (a short int or an int or long long,), the sign bit is extended to the left. To keep the sign bit from being extended, you have to use unsigned values.

**maestro371** · 04-13-2008

Originally Posted by Todd Burch

No, good guess, but wrong.

A byte is represented by 8 bits, or, an "octet". 8 bits is 2^8, or, 256, which is 256 values represented by 0-255.

A short int on modern computers is usually 2 bytes, aka 16 bits, and called a "short" (a "halfword" on some platforms) and is 2^16, or 65,536, which is 65,536 values or 0-65,535. AS you can see, a decimal 128 its quite easily into a short.

In a byte, when the sign bit is on (the 0x80 bit), and the value is moved into a larger field (a short int or an int or long long,), the sign bit is extended to the left. To keep the sign bit from being extended, you have to use unsigned values.

Got it. Thanks for the correction.

Thread: Hexadecimal Characters in RegEx

Thread Tools

Search Thread

Display

Hexadecimal Characters in RegEx

Fantastic

Slight Alterations

Similar Threads

A development process

[URGENT] Getting warning: null character(s) ignored repeatedly

How do you check how many characters a user has entered?

help with text input

Characters to binary... hexadecimal to binary