randomly picking a word out of a text file

**lilrayray** · 08-01-2006

I have basic knowledge of file in out and random functions. For a current project I would like to have a file opened, a word randomly picked out of the text file and assigned to a varriable. I can't quite figure out how to do this. Any hints/help is greatly appreciatted.

**quzah** · 08-01-2006

Find the size of the file. fseek to a random spot. If it's a space, or "not a word" token, skip ahead or back, randomly if you like, till you hit a word, and use it. If it's not a space or "not a word token", then take the word you're on. (Back up to the start of the word, read the word into a variable.)

There's one way to do it.

Quzah.

**itsme86** · 08-01-2006

Count the words in the file, generate a random number within those bounds, seek to that word, read it in.

**itsme86** · 08-01-2006

Originally Posted by quzah

Find the size of the file. fseek to a random spot. If it's a space, or "not a word" token, skip ahead or back, randomly if you like, till you hit a word, and use it. If it's not a space or "not a word token", then take the word you're on. (Back up to the start of the word, read the word into a variable.)

There's one way to do it.

Quzah.

But that's biased towards longer words

**quzah** · 08-01-2006

All words are not created equal.

Quzah.

**Wraithan** · 08-01-2006

File
Array of Strings
Random number within array bounds
Random word

**itsme86** · 08-01-2006

As you can see there's lots of ways and one particular choice will be better given different circumstances. Quzah's is extremely efficient and works well if the words are the same length. Mine is kind of a works-for-all-circumstances kind of thing. Wraithan's works well, but uses a lot more memory so if the file is small his method will work fine and outperform mine.

**TriKri** · 08-01-2006

quzah's way is the fastest and biased towards longer words and words with higher occurrence. itsme86's way is biased towards words with higher occurrence. It all depends on how random you want it to be and how you want random to be.

**Wraithan** · 08-01-2006

It does use a lot more memory, but even a file with 1000s of words doesn't take up much room... the .dic file that comes with crimson editor (used for spell check) is only 950ishK, less than a gig and has the majority of every day words. If we want to look at it from another perspective, yours takes more processor.

But there are even more ways to do this... try doing a search on google or on this forum since this quesiton has been asked like a million times...

**quzah** · 08-01-2006

I wasn't going for efficiency, I was going for lazy.

It also should be noted that the OP didn't provide any specifics with regards to bias, duplicate words, efficiency, etc.

Quzah.

**quzah** · 08-01-2006

Count each word.
Track the offset of each word.
... allocate list size.
... reallocate list size if needed.
Seek to a random word.

Read each word.
... allocate space for each.
... add word to list.
... reallocate list size if needed.
Seek to a random spot in the list.

Both of your methods are going to be reading every word anyway, so the first hit of reading negates eachother. We will assume they are both keeping track of the number of words, because they both have to seek in their boundry of allocated words, so that also negates eachother.

The second method gains overhead as it allocates space for each word, in the form of increasing its memory footprint, as well as actually filling that space. The former doesn't really get much overhead here by comparison, because it's easy to do one large *alloc to hold a bunch of integers, then it is to make multiple calls to malloc for each word. They both will incur the same overhead if they have to realloc. However, the second method still falls behind here, because it continually allocates space for each word.

Seek a word. The second recoups some of its loss here, and becomes more efficient each additional time a word is required. The first method loses each time it's required seek in file. Also, each time we have to copy from disk to whatever variable, we'll lose a bit there.

Really what you end up with is efficiency based on how many times you need a random word. The more you need them, the better speed wise it becomes to store them all in memory. It's always a trade off. You can have a small memory footprint, but you sacrifice speed in doing so.

[edit type=refreshing_forum_before_hitting_post]
The above is in reply to a post or two that have since been removed. Anyway, for the OP, there's the difference between the options provided by both itsme86 and Wraithan.
[/edit]

Quzah.

**lilrayray** · 08-01-2006

ok, Ill try some of these ideas out.

Thread: randomly picking a word out of a text file

Thread Tools

Search Thread

Display

randomly picking a word out of a text file

Similar Threads

gcc link external library

Basic text file encoder

Randomly shuffle lines of huge text file

Wrong Output

Unknown Memory Leak in Init() Function