Thread: text processing - huge amount of data

  1. #1
    Registered User
    Join Date
    Nov 2003
    Posts
    183

    text processing - huge amount of data

    Hello,

    I need some advice please.

    I have to extract 70000 new words from some corpus and add them to a database which already has 30000 words.

    I don't know which approach I should take so that I won't face 1- memory problems and 2- lots of in loop checking.

    I though a good way might be what "sets" represent (no repetitive items can be seen in sets) but I am not sure if C# has that data structure or not?

    I would appreciate ant suggestion and help.
    Thank you in advance
    Arian

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    I suggest that you use SQLite or something similiar. This way, you can take advantage of the database engine to have a unique index, basically offloading the hard database work.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    The HashSet<T> type can certainly assist you with that. Duplicate additions will automatically be discarded so you can simply load the initial 30k words into the object and then add the next 70k without manual checks. It internally uses a hashtable (couldn't have guessed that from the name, right?) so duplication checks are very quick.

    As far as memory is concerned, definitely should not be a problem.

    The only caveat is that order is not maintained. The final list will contain each word exactly once, but if you iterate over the list, they'll come in a seemingly random order.
    Last edited by itsme86; 08-19-2015 at 09:47 AM.
    If you understand what you're doing, you're not learning anything.

  4. #4
    Registered User
    Join Date
    Nov 2003
    Posts
    183
    Thank you very much laserlight and itsme86

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. best way to sort large amount of data.
    By jocdrew21 in forum C++ Programming
    Replies: 32
    Last Post: 02-10-2014, 01:11 PM
  2. Replies: 5
    Last Post: 06-02-2012, 04:47 AM
  3. comparing huge text
    By jamie_123 in forum C Programming
    Replies: 7
    Last Post: 10-23-2011, 01:08 AM
  4. HUGE custom data type
    By fuzzypig in forum C Programming
    Replies: 4
    Last Post: 06-25-2007, 02:03 AM
  5. reading from a text file a certain amount
    By jodders in forum C++ Programming
    Replies: 2
    Last Post: 02-18-2005, 04:31 AM