Compiling c++ intrinsics commands

This is a discussion on Compiling c++ intrinsics commands within the C++ Programming forums, part of the General Programming Boards category; Hallo, I have started looking at the SSE intrinsics commands for c++, but I cant the one of the first ...

  1. #1
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485

    Compiling c++ intrinsics commands

    Hallo,

    I have started looking at the SSE intrinsics commands for c++, but I cant the one of the first examples to compile, so I was wondering if anyone could give me a hand?

    Code:
    #include <iostream>
    #include <xmmintrin.h>
    
    using namespace std;
    
    float pArray1[4] = {2.5 , 5.0 , 7.5, 10};
    float pArray2[4] = {3.5 , 6.0 , 8.5, 11};
    float result[4];
    
    
    int main()
    {
    	__m128 m1, m2, m3, m4;
    
    	__m128* a1 = (__m128*)pArray1;
    	__m128* a2 = (__m128*)pArray2;
    	__m128* res = (__m128*)result;
    
    
    	m1 = _mm_mul_ps(*a1, *a1);        // m1 = *a1 * *a1
    
        m2 = _mm_mul_ps(*a2, *a2);        // m2 = *a2 * *a2
    
        m3 = _mm_add_ps(m1, m2);          // m3 = m1 + m2
    
        m4 = _mm_sqrt_ps(m3);             // m4 = sqrt(m3)
    
    
    
    
    	union u
    	{
    		__m128 m;
    		float f[4];
    	} x;
    
    	x.m = m4;
    
    	cout << x.f[0];
    
    	system("PAUSE");
    	return 0;
    }
    Thanks,

  2. #2
    Registered User hk_mp5kpdw's Avatar
    Join Date
    Jan 2002
    Location
    Northern Virginia/Washington DC Metropolitan Area
    Posts
    3,799
    What are the compiler errors you are getting?
    "Owners of dogs will have noticed that, if you provide them with food and water and shelter and affection, they will think you are god. Whereas owners of cats are compelled to realize that, if you provide them with food and water and shelter and affection, they draw the conclusion that they are gods."
    -Christopher Hitchens

  3. #3
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    I solved the compiler error, but I now have a runtime error.

    As soon as the program starts I get this error:
    Unhandled exception at 0x0041140e in SSEtest.exe: 0xC0000005: Access violation reading location 0xffffffff.
    And it points to this line:
    m1 = _mm_mul_ps(*a1, *a1); // m1 = *a1 * *a1

  4. #4
    Registered User hk_mp5kpdw's Avatar
    Join Date
    Jan 2002
    Location
    Northern Virginia/Washington DC Metropolitan Area
    Posts
    3,799
    What's the prototype for that function? Maybe you're passing the arguments wrong.
    "Owners of dogs will have noticed that, if you provide them with food and water and shelter and affection, they will think you are god. Whereas owners of cats are compelled to realize that, if you provide them with food and water and shelter and affection, they draw the conclusion that they are gods."
    -Christopher Hitchens

  5. #5
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,662
    Where did you get the code?
    Here? http://www.codeproject.com/KB/recipes/sseintro.aspx

    You don't have "__declspec(align(16))" on your arrays like you should.

    gg

  6. #6
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    I somehow missed that.

    Thanks.

    How do I load a single byte of data? I have found how to load a bunch of different things, but what I really need is to load a single byte. Is it even possible?

    The bytes are aligned (or at least I think they are, as they are stored in one huge array)

  7. #7
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    Not possible. Why would you load a single byte into SSE?

    Anyway, you can over-read, or you can read the single byte into a normal register, zero-pad it and move it over.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  8. #8
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    The data I want to work on is in bytes (8bit). So the best would be to load 16 separate bytes into each SSE register, as then I would not have to pack/unpack them.

    If I pack 4 bytes into an Int and then load it, how would I go about say adding one register to an other, considering I want the addition to work on each byte, not packs of four. I think I would have to mask each value of, but would that not defeat the purpose of SSE?

  9. #9
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Can you describe a bit what you are actually trying to achieve? It does, like you imply, seem like this is a poor match for using SSE. SSE works best if you have nicely aligned (to 16 bytes) data that is already in suitable lumps of four 32-bit values (or two 64-bit values).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #10
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    I am still working on the software blitting that I have created threads for before.

    I have array of bytes which is the texture I am trying to blitt to the screen, and one array of bytes for the screen

    The data is stored like this, but there is no problem rearranging this into something else.
    textureData[0] <-- pixel_1 Blue
    textureData[1] <-- pixel_1 Green
    textureData[2] <-- pixel_1 Red
    textureData[3] <-- pixel_1 Alpha
    textureData[4] <-- pixel_2 Blue

    Where each data is one byte (8bit)

    The bltting formula is this, where I have to run the equation for each of the colors in a pixel.
    result = ALPHA * ( srcPixel - destPixel ) + destPixel


    What I was planning first was to convert the data so that I pack four bytes into an int(32bit), so instead of storing each colour of each pixel as a separate element I would just store each pixel

    And then load 4 texture pixels into each of the first four SSE registers (4 * 32 = 128)
    And load 4 screen pixels into the four last registers.

    But then I started wondering about how I then would do the blitting calculation, as I need to work with each colour not the entire pixel as one. From what I understand, if I start masking out the values that I need, I will loose the benefits of SSE, rendering my new code as slow as the old blitting code.

    So then I decided that it would be better to load all the red values into the first register, all blues into second and so on, as then I could work on them the way I wanted. I am not even sure its a smart idea, but Im new to this and for now, thats the idea that makes more sense to me. But in order to achieve that I need to load 16 separate bytes into each registers, which I am not sure is possible.

    I think it is possible to make a new formula that would work on pixel basis (not colour by colour), but im not sure how to.

    Please ask if any of this does not make any sense.

    Thanks for reading

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Hmm. I'd say that you do need to split the R, G, B into separate portions and do the operation on each (with a "saturating" operation, most likely).

    I take it that you can't (or haven't got) graphics hardware to perform the blit itself? That would probably be the ideal. Even if you don't want to display the data, you could simply use off-screen blit operations to perform the combining of the texture and background image.


    Preprocessing the data to to store the R, G, B (and A) values in separate arrays will help - at least textures can (I guess) have this done once [that is, unless the texture changes as your application runs]. It will probably also make sense to form floating point values from the byte values [this will take up 4x the space, but reduces the need for a conversion later]. You can just divide each byte value by 255 to get a 0.0 .. 1.0 value range, that makes all the calculations work better.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #12
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    Thanks for your input matsp

    If its possible, I would like to keep the data as bytes, as then I can do 4 times as many pixels in one go. But the conversion (at run time) will probably take up some amount of time (read that float to int/int to float is very very slow)

    Is it possible to do it without using floats?

    Say I have this setup:
    Code:
    __m128 textureRed
    __m128 textureBlue
    __m128 textureGreen
    __m128 textureAlpha
    
    __m128 screenRed
    __m128 screenBlue
    __m128 screenGreen
    __m128 temp
    
    __m128 newData = screenRed * textureRed
    As there is only 8 registers, will newData couse a problem with performance?

  13. #13
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,546
    Using SSE, you'd want to combine 4 pixels or so and process them in one go.
    Processing them as bytes, one by one, is by far slower.
    That's about all I know.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  14. #14
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by h3ro View Post
    Thanks for your input matsp

    If its possible, I would like to keep the data as bytes, as then I can do 4 times as many pixels in one go. But the conversion (at run time) will probably take up some amount of time (read that float to int/int to float is very very slow)
    How do you propose to do 4 bytes in one 32-bit integer? SSE doesn't have bytewise multiply by integer (with or without saturation). The best you can get is probably 4 integer operations in one go, using 32-bit integers.

    If you stuff 4 bytes into one 32-bit integer, and multiply it, and you have an overflow it will affect other pixels outside the one you wanted to change, which is definitely not a good plan.
    Is it possible to do it without using floats?
    Sure.
    Say I have this setup:
    Code:
    __m128 textureRed
    __m128 textureBlue
    __m128 textureGreen
    __m128 textureAlpha
    
    __m128 screenRed
    __m128 screenBlue
    __m128 screenGreen
    __m128 temp
    
    __m128 newData = screenRed * textureRed
    As there is only 8 registers, will newData couse a problem with performance?
    The newData = screenRed * textureRed will use at most three registers in itself. Now, having red, green, blue, alpha from texture, red, green, blue from screen will require 7 different values, with another 3 values as the results for r, g, b, which would mean 7 registers - but not all of those need to be active at the same time, so assuming the compiler is half-way decent, it should be possible to use the 8 existing registers without too much overhead of filling and spilling (storing/retrieving from memory).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  15. #15
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    Thanks again Mats, im starting to get the picture now. I even have something working

    One last question:
    >>You can just divide each byte value by 255 to get a 0.0 .. 1.0 value range, that makes all the >>calculations work better.
    Why would 0.0 to 1.0 be better to work with?

    Also, for the conversions, I assume its better to use the SSE2 commands for going from float to byte/int then simply casting in normal c++?

Page 1 of 3 123 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Get user commands from text file.
    By Ironic in forum C Programming
    Replies: 4
    Last Post: 12-08-2008, 10:38 PM
  2. Replies: 2
    Last Post: 07-27-2007, 12:48 PM
  3. Screwy Linker Error - VC2005
    By Tonto in forum C++ Programming
    Replies: 5
    Last Post: 06-19-2007, 02:39 PM
  4. Disable ALT key commands
    By Lionmane in forum Windows Programming
    Replies: 9
    Last Post: 09-23-2005, 10:41 AM
  5. Dos commands hehe
    By Carp in forum A Brief History of Cprogramming.com
    Replies: 2
    Last Post: 01-17-2003, 01:51 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21