Thread: how to find the longest meaning substring?

  1. #1
    Registered User
    Join Date
    May 2011
    Posts
    5

    how to find the longest meaning substring?

    I learned C abt 4 years ago, and havn't touched it since then. Recently, I need to program in C to find the longest meaning substring of a string without white space, given a sorted dictionary text file.

    For example, I've a string "europeaninvestmentbank", I first want to identify "european" "investment" "bank", among other possible meaningful substrings, and then find the length of longest meaningful substring is 10 for "investment".

    Is there any existing solutions in C?
    Or I need to implement from scratch?

    Thanks a million!

  2. #2
    Registered User Inanna's Avatar
    Join Date
    May 2011
    Posts
    69
    I always start with the easiest solution because I am not very smart. The easiest solution is strstr() on the whole search string starting with the longest string from the dictionary:
    Code:
    char** sorted_dict = dictionary_by_length();
    
    while (sorted_dict)
    {
        if (strstr(source, *sorted_dict))
        {
            printf("Longest meaningful substring: '%s'\n", *sorted_dict);
            break;
        }
    }
    dictionary_by_length() is a placeholder for whatever method you use to get the dictionary strings from longest to shortest.

  3. #3
    Registered User
    Join Date
    May 2011
    Posts
    5
    Thanks very much Inanna, I think you are pointing me to the right direction. strstr() is indeed handy.

    I'm not sure about using strstr directly as "strstr(source, *sorted_dict)", as it would return only the occurrence of complete "source" in dictionary, and what I want is the substrings of source.

    As in the previous example, in the case of "europeaninvestmentbank" as source, I want to spot substring "investment" as the longest meaningful substring...

    I'm trying to compare each word in my dictionary with every possible substring of my source, smartly (pfffff, don't know how exactly yet), with strstr(substring, word).

    And currently got stuck at the very first step: while importing dictionary text into a 2d array of char. realloc() gave invalid next size error from now and then! No clue what's happening.





    Quote Originally Posted by Inanna View Post
    I always start with the easiest solution because I am not very smart. The easiest solution is strstr() on the whole search string starting with the longest string from the dictionary:
    Code:
    char** sorted_dict = dictionary_by_length();
    
    while (sorted_dict)
    {
        if (strstr(source, *sorted_dict))
        {
            printf("Longest meaningful substring: '%s'\n", *sorted_dict);
            break;
        }
    }
    dictionary_by_length() is a placeholder for whatever method you use to get the dictionary strings from longest to shortest.

  4. #4
    Registered User Inanna's Avatar
    Join Date
    May 2011
    Posts
    69
    I'm not sure about using strstr directly as "strstr(source, *sorted_dict)", as it would return only the occurrence of complete "source" in dictionary, and what I want is the substrings of source.
    strstr("europeaninvestmentbank", "investment") will return &"europeaninvestmentbank"[8] because that is where the substring was found. It will return NULL if the substring was not found. Assuming you have the complete source string in memory and only search for the longest substrings first, strstr() will do what you are asking for.
    Code:
    #include <stdio.h>
    #include <string.h>
    
    // Ordered by length of the string
    char const* dictionary[] =
    {
        "investment",
        "european",
        "bank",
        NULL
    };
    
    int main()
    {
        char const* source = "europeaninvestmentbank";
        char const** sorted_dict = dictionary;
    
        while (sorted_dict)
        {
            if (strstr(source, *sorted_dict))
            {
                printf("Longest meaningful substring: '%s'\n", *sorted_dict);
                break;
            }
    
            ++sorted_dict;
        }
    }
    And currently got stuck at the very first step: while importing dictionary text into a 2d array of char. realloc() gave invalid next size error from now and then! No clue what's happening.
    Any kind of buffer overflow on a pointer to heap memory can mess up the heap and cause weird errors. Try this code on your dictionary file and see if the same thing happens:
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #define TEST
    
    int length_order(void const* lhs, void const* rhs);
    
    int main()
    {
        FILE* fp = fopen("test.txt", "r");
        char** dictionary = NULL;
        char word[1024];
        size_t size = 0;
    
        // Populate the dictionary
        while (fscanf(fp, "%1023s", word) == 1)
        {
            // Grow dictionary by 1 each time and assume allocations succeed for simplicity
            dictionary = (char**)realloc(dictionary, ++size * sizeof(char**));
            dictionary[size-1] = (char*)malloc(strlen(word)+1);
            strcpy(dictionary[size-1], word);
        }
    
        fclose(fp);
    
        // Sort the dictionary by string length
        qsort(dictionary, size, sizeof(char*), length_order);
    
    #if defined(TEST)
        // Dump the dictionary for testing
        for (size_t x = 0; x < size; ++x)
        {
            puts(dictionary[x]);
        }
    #endif
    
        // Clean up the dictionary
        for (size_t x = 0; x < size; ++x)
        {
            free(dictionary[x]);
        }
    
        free(dictionary);
    }
    
    int length_order(void const* lhs, void const* rhs)
    {
        return strlen(*(char const**)rhs) - strlen(*(char const**)lhs);
    }

  5. #5
    Registered User
    Join Date
    May 2011
    Posts
    5
    Hi Inanna
    It works now!
    First, I define my dictionary data structure as the following, words are stored in an 2d array of char called word_store. each array holds words of the same length in sorted order. For example, *word_store[1] would be "amasat". counts is an array of int which keeps track of how many words has been stored in each char array in word_store.
    Code:
    struct dictionary{
    	char *word_store[MAX_WORD_LEN];
    	int counts[MAX_WORD_LEN];
    } my_dic;
    For LMS searching, I tried strstr on "europeaninvestmentbank" and all words in word_store[21], if no occurrence returned, I would move on to search in word_store[20] ... until a first match is found.

    Basically the same at your proposal except that I try to match word of the same length or shorter than searching target.
    The algorithm is not like a rocket science, but dynamic memory allocation and initialization is a real pain in the ass. Still feel like there should be a better way to do it.

    Thank you!

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    Way to spam with the cross-posts!

    C - how to find the longest meaning substring?
    How to find the longest meaning substring? - Dev Shed

    Though I'm betting the TT one will be gone real soon now.

    Now read up on how to choose a forum
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Registered User
    Join Date
    May 2011
    Posts
    5
    Don't see why it's a problem to ask the same question in more than one forums, the thing I'm saying might interest people in other forums as well doesn't it?

    I'm really speechless to see my post got deleted from TT forum, are forums like exclusive clubs, you can only be a member of one of those?!

    Is your spirit of open source only apply within your forum?!

    Quote Originally Posted by Salem View Post
    Way to spam with the cross-posts!

    C - how to find the longest meaning substring?
    How to find the longest meaning substring? - Dev Shed

    Though I'm betting the TT one will be gone real soon now.

    Now read up on how to choose a forum

  8. #8
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    It's like ordering a pizza from 3 different places, but only acknowledging and paying the guy who shows up first with your pizza. Those other two restaurants are out the time and money on the pizza you ordered, but didn't accept or pay for, all because you're impatient. Pretty rude and selfish from the pizza company's point of view.

    Read the "how to choose a forum" link Salem posted, it explains the issue in more detail.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to find the longest word in text file?
    By alionas in forum C Programming
    Replies: 7
    Last Post: 03-07-2011, 01:05 PM
  2. longest common substring problem
    By Enchanter88 in forum C++ Programming
    Replies: 4
    Last Post: 09-29-2007, 11:02 AM
  3. How find a substring in struct
    By nick048 in forum C Programming
    Replies: 5
    Last Post: 04-07-2007, 10:03 AM
  4. longest common substring
    By TechHigh in forum C Programming
    Replies: 6
    Last Post: 01-05-2007, 03:18 AM
  5. Replies: 11
    Last Post: 12-11-2005, 11:50 PM