Thread: A Simple Compiler/Linker

  1. #1
    Registered /usr
    Join Date
    Aug 2001
    Location
    Newport, South Wales, UK
    Posts
    1,273

    A Simple Compiler/Linker

    Hello,

    Me being the inquisitive sort that I am, I've become interested in compiler/linker development. It's not a new disease, I've had it for a couple of years now ever since I worked on a Flash-esque presentation engine and wanted to be able to make standalone programs from the data.

    Here's my first effort, it's pretty simple: give it a string and it will spit out a DOS program called "compiled.exe" that displays the string. Not sure if the header is 100% correct, but it works for me. Anyway...
    Code:
    #include <stdio.h>
    #include <string.h>
    
    #define CODE_SIZE	25
    
    typedef struct {
    	unsigned short usMZ;				// This is always = 'ZM'
    	unsigned short usLastPageSize;		// Last page size in bytes
    	unsigned short usNumPages;			// Number of 512 byte pages
    	unsigned short usNumRelocations;	// ?
    	unsigned short usHeaderSize;		// Size of header in paragraphs (16 bytes)
    	unsigned short usMinMemory;			// Minimum memory in addition to code
    	unsigned short usMaxMemory;			// Maximum memroy in addition to code
    	unsigned short usInitialStackSeg;	// Initial stack segment
    	unsigned short usInitialStackOffset;// Initial stack offset
    	unsigned short usChecksum;			// 0
    	unsigned long ulEntryPoint;			// CS:IP
    	unsigned short usRelocTableOffset;	// ?
    	unsigned short usOverlayNumber;		// 0
    } dosheader;
    
    int main(void)
    {
    	char szBuffer[BUFSIZ], *newline;
    	int i;
    	FILE *fp;
    	dosheader header;
    
    	printf("Enter text to be printed on execution: ");
    	fgets(szBuffer, BUFSIZ, stdin);
    	newline = szBuffer + strlen(szBuffer);
    	if (*newline == '\n')
    		*newline = 0;
    
    	header.usMZ = 'ZM';
    	header.usLastPageSize = CODE_SIZE + strlen(szBuffer) + 1;
    	header.usNumPages = 1;
    	header.usNumRelocations = 0;
    	header.usHeaderSize = 2;
    	header.usMinMemory = 0;
    	header.usMaxMemory = 0xFFFF;
    	header.usInitialStackSeg = 0;
    	header.usInitialStackOffset = 0;
    	header.usChecksum = 0;
    	header.ulEntryPoint = 0;
    	header.usRelocTableOffset = 0;
    	header.usOverlayNumber = 0;
    	fp = fopen("compiled.exe", "wb");
    	if (!fp)
    		return -1;
    
    	fwrite(&header.usMZ, 2, 1, fp);
    	fwrite(&header.usLastPageSize, 2, 1, fp);
    	fwrite(&header.usNumPages, 2, 1, fp);
    	fwrite(&header.usNumRelocations, 2, 1, fp);
    	fwrite(&header.usHeaderSize, 2, 1, fp);
    	fwrite(&header.usMinMemory, 2, 1, fp);
    	fwrite(&header.usMaxMemory, 2, 1, fp);
    	fwrite(&header.usInitialStackSeg, 2, 1, fp);
    	fwrite(&header.usInitialStackOffset, 2, 1, fp);
    	fwrite(&header.usChecksum, 2, 1, fp);
    	fwrite(&header.ulEntryPoint, 4, 1, fp);
    	fwrite(&header.usRelocTableOffset, 2, 1, fp);
    	fwrite(&header.usOverlayNumber, 2, 1, fp);
    	for (i=0;i<4;i++)	// Pad out header
    		putc(0, fp);
    
    	// Right, there's no stack, so we're doing this the hard way...
    	putc(0x8C, fp);		// mov ax, cs
    	putc(0xC8, fp);
    	putc(0x8E, fp);		// mov ds, ax
    	putc(0xD8, fp);
    	putc(0xBE, fp);		// mov si, offset message
    	putc(CODE_SIZE, fp);
    	putc(0, fp);
    	putc(0xB4, fp);		// mov ah, 0Eh : Teletype
    	putc(0x0E, fp);
    	putc(0x33, fp);		// xor bx, bx
    	putc(0xDB, fp);
    	putc(0xAC, fp);		// lodsb
    	putc(0x3C, fp);		// cmp al, 0
    	putc(0, fp);
    	putc(0x74, fp);		// jz (pos + 4)
    	putc(4, fp);
    	putc(0xCD, fp);		// int 10h (Video BIOS)
    	putc(0x10, fp);
    	putc(0xEB, fp);		// jmp (pos - 9)
    	putc(0xF7, fp);
    	putc(0xB8, fp);		// mov ax, 4C00h : Exit with error code 0
    	putc(0, fp);
    	putc(0x4C, fp);
    	putc(0xCD, fp);		// int 21h (DOS)
    	putc(0x21, fp);
    	fwrite(szBuffer, strlen(szBuffer) + 1, 1, fp);
    	fclose(fp);
    	printf("Compilation successful.\n");
    	return 0;
    }
    There's nothing seriously wrong with it as such, except that if you type in a largeish message it'll bomb (Due to the number of pages in the program being fixed at 1). Could anyone suggest further improvements to this?

    I'll do a Windows one next.

  2. #2
    Pokemon Master digdug4life's Avatar
    Join Date
    Jan 2005
    Location
    Mystic Island, NJ
    Posts
    91
    Code:
    int main([B]void[/B])
    {
        char szBuffer[BUFSIZ], *newline;
        int i;
        FILE *fp;
        dosheader header;
    SALEM!
    Verbal Irony >>

    "I love english homework!" When really nobody like english homework.
    -Mrs. Jennifer Lenz (English Teacher)

  3. #3
    Registered User
    Join Date
    Mar 2004
    Posts
    536
    Quote Originally Posted by SMurf
    Hello,


    There's nothing seriously wrong with it as such, except that if you type in a largeish message it'll bomb (Due to the number of pages in the program being fixed at 1). Could anyone suggest further improvements to this?

    I'll do a Windows one next.
    Hey: that's kinda cute! (Actually, it's really cute.)

    One thing I don't like (multi-byte character constant)
    Code:
      header.usMZ = 'ZM';
    You could have something like this:
    Code:
    /* define char pointer, set equal to usMZ address: */
    
        char *pChar = (char *)&header.usMZ;
    .
    .
    .
        *pChar++ = 'Z';
        *pChar   = 'M';
    Regards,

    Dave

  4. #4
    ---
    Join Date
    May 2004
    Posts
    1,379
    digdug4life:
    He is calling main correctly

  5. #5
    Registered User
    Join Date
    Jan 2003
    Posts
    78
    Quote Originally Posted by Dave Evans
    You could have something like this:
    Code:
    /* define char pointer, set equal to usMZ address: */
    
        char *pChar = (char *)&header.usMZ;
    .
    .
    .
        *pChar++ = 'Z';
        *pChar   = 'M';
    The problem with doing it that way is you have to get it right depending on the platform. Since this is a DOS compiler it's reasonable to assume it will run on an x86 machine ("Little Endian"). That being the case, you have the characters reversed.

    It's also, IMHO, rather ugly code (a pointer and a cast just to stuff two characters into an unsigned short?). You could do it with a lot less code like this...

    Code:
    unsigned short usMZ = ('Z' << 8) + 'M';
    Though, I don't have any problem with the way SMurf did it.

    -Rog

  6. #6
    Registered User
    Join Date
    Mar 2004
    Posts
    536
    Quote Originally Posted by Rog
    The problem with doing it that way is you have to get it right depending on the platform. Since this is a DOS compiler it's reasonable to assume it will run on an x86 machine ("Little Endian"). That being the case, you have the characters reversed.

    It's also, IMHO, rather ugly code (a pointer and a cast just to stuff two characters into an unsigned short?). You could do it with a lot less code like this...

    Code:
    unsigned short usMZ = ('Z' << 8) + 'M';
    Though, I don't have any problem with the way SMurf did it.

    -Rog

    [edit]Note editing: this post originally had incomplete code [/edit]

    In the first place, my code in the previous post is in error: It puts 'Z', 'M' in the first two positions of the file, and the .exe file "magic number" has 'M' and 'Z' as the first two bytes. Sorry. However, it is immediately obvious from the code what the two bytes are. (Doesn't depend on the compiler that compiles the "compiler" or the machine that the "compiler" was compiled on.) This makes debugging a little easier for me.

    So, change it to
    Code:
        *pChar++ = 'M';
        *pChar   = 'Z';
    This is a system-independent method of putting bytes into a non-char variable in a specified order.

    Your code correctly puts 'M' in the first byte and 'Z' in the second byte if and only if the "compiler" is compiled on a little-endian machine.

    The code in the original program (using multi-character character constant) compiles differently for different compilers on my Windows XP machine.

    I tried this with Borland, Microsoft and GNU compilers:

    Code:
    #include <stdio.h>
    int main()
    {
      unsigned short x;
      char *y;
    
      x = 'AB';
    
      y = (char *)&x;
    
      printf ("x = 0x%04x\n", x);
    
      printf("y[0] = %c, y[1] = %c\n", y[0], y[1]);
    
      return 0;
    }
    Borland gave this result (indicating that the bytes in the char constant were not swapped):
    x = 0x4241
    y[0] = A, y[1] = B
    Microsoft and GNU gave this (indicating that the bytes in the char constant were swapped):
    x = 0x4142
    y[0] = B, y[1] = A
    (GNU gave a compile-time warning: warning: multi-character character constant; the others were silent.)

    That's why I said that I didn't like the original program construct.


    Now, as to whether my code is uglier than your shift routine: I say, "Ugly is as ugly does." Your code works on little-endian machines. My code doesn't depend on endianness. I don't think your code is necessarily ugly: if it works for you --- OK.

    However, this is a learning exercise, and I think programmers should always be aware of system-dependencies (such as endianness) and implementation-dependencies (value of integer multi character constants is implementation-defined).

    [edit]
    Other system dependencies (which I have not addressed here) include assumptions of sizes of unsigned short (16 bits) and unsigned long (32 bits).
    [/edit]

    Think of releasing your code for a cross-compiler under open-source licensing and someone is going to compile the cross-compiler with a different compiler that you used, and/or on a different-endian machine.

    Regards,

    Dave
    Last edited by Dave Evans; 02-17-2005 at 12:19 PM.

  7. #7
    Registered /usr
    Join Date
    Aug 2001
    Location
    Newport, South Wales, UK
    Posts
    1,273
    I think it's against the rules to dig out "old" posts, a mod may come and lock this, but seeing as you peeps did actually do me a favour and reply, I'll say this:-

    Yes, I know multi-byte character constants are wrong.
    Really it should've either been defined as a two char array, being careful not to use string functions, or even as two seperate chars.
    It was just a quick fixup in Visual C++, it didn't mind.

    Also it's not really a compiler as it doesn't actually compile any code, it just links a pre-written routine together with a DOS executable header and some data (The string it asks for). Lexical analysis for a language is tricky.

  8. #8
    Registered User
    Join Date
    Mar 2004
    Posts
    536
    Quote Originally Posted by SMurf
    I think it's against the rules to dig out "old" posts, a mod may come and lock this, but seeing as you peeps did actually do me a favour and reply, I'll say this:-

    Yes, I know multi-byte character constants are wrong.
    Really it should've either been defined as a two char array, being careful not to use string functions, or even as two seperate chars.
    It was just a quick fixup in Visual C++, it didn't mind.

    Also it's not really a compiler as it doesn't actually compile any code, it just links a pre-written routine together with a DOS executable header and some data (The string it asks for). Lexical analysis for a language is tricky.

    I understand your points.

    Here are my points:

    Lots of people read these threads without ever asking questions or posting anything. That's why sometimes the dialogue continues long after the original poster has been satisfied (or has dozed off, or has gone to some more rewarding career, or ...)

    As a learning tool, I think your project is interesting, since it points up a few places where system- and implementation- dependant code can lead to surprises when the student gets turned loose on the "real world" (man --- how I hate the 'real world'). I think that writing to binary files is a real good place to find out the importance of endian-ness. You know exactly where the bytes should be and what they should be.

    I dare say that lots of us have put something "quick and slightly dirty" to test a concept, with the idea of doing it the "right way" at some later time. Maybe we cleaned it up, or maybe not. But as a learning tool, I didn't feel right about ignoring a couple of points that I feel are important. (And in this case it doesn't actually add any complexity to the problem to make it a little more robust.)

    Just my USD 0.02

    Regards,

    Dave

    P.S. I like your program. I compiled it and executed it and just sat here with a little grin on my face when it told me, "Hi." Any day that gives me a smile ---even a little one--- is a Good Day. A real nostalgic visit to the olden days, before printf("hello world");



    "The opinions expressed here are not necessarily my own --- It's these dang voices in my head."
    Last edited by Dave Evans; 02-17-2005 at 03:38 PM.

  9. #9
    Registered User
    Join Date
    Jan 2003
    Posts
    78
    Quote Originally Posted by Dave Evans
    However, this is a learning exercise, and I think programmers should always be aware of system-dependencies (such as endianness) and implementation-dependencies (value of integer multi character constants is implementation-defined).
    And that's why I replied when I saw that you had it wrong.

    Your method of using a pointer to char does not make the code portable. Your pointer arithmetic is accomplishing essentially the same thing that my shift operation is.

    I do, however, stand corrected about the multi-character constant. You are correct... they should be avoided.


    -Rog

  10. #10
    Registered User
    Join Date
    Mar 2004
    Posts
    536
    Quote Originally Posted by Rog
    And that's why I replied when I saw that you had it wrong.

    Your method of using a pointer to char does not make the code portable. Your pointer arithmetic is accomplishing essentially the same thing that my shift operation is.

    I do, however, stand corrected about the multi-character constant. You are correct... they should be avoided.


    -Rog
    The code in my original post was wrong because I made a mistake and wrote the wrong bytes.

    If you want to write bytes in certain positions in memory, you can do it with a pointer-to-char. Set the pointer to the address of the first byte position and increment it to write successive bytes. This is portable, in the sense that it does not depend on the endianness of integer storage. The bit-shifting is absolutely, explicitly dependant on the endianness of the implementation.

    In order to make the pointer method absolutely portable, one should use whatever implementation-dependant data type that corresponds to eight bits (since, as has been pointed out on this forum, the C standard does not guarantee that a char is always eight bits). I felt that was a little too abstract for now, but is not absolutely negligible,

    Regards,

    Dave

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. creating very simple text editor using c
    By if13121 in forum C Programming
    Replies: 9
    Last Post: 10-19-2010, 05:26 PM
  2. Simple message encryption
    By Vicious in forum C++ Programming
    Replies: 10
    Last Post: 11-07-2004, 11:48 PM
  3. Binary Search Trees Part III
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 16
    Last Post: 10-02-2004, 03:00 PM
  4. Simple simple program
    By Ryback in forum C++ Programming
    Replies: 10
    Last Post: 09-09-2004, 05:48 AM
  5. Need help with simple DAQ program
    By canada-paul in forum C++ Programming
    Replies: 12
    Last Post: 03-15-2002, 08:52 AM