This is a discussion on fread() on a bad disk? within the Linux Programming forums, part of the Platform Specific Boards category; Originally Posted by annied tried with fcntl(), I got passed the read() but then hang on the close(fd). I'm sure ...
The only way to get around this that I can think of, is to set up a timer with alarm() to send a signal to yourself after a small period, say 10 seconds. Do this right before trying to close() the fd. If close() hangs, the signal will interrupt it and it will return the EINTR error status. I think. I don't have a way to test it, but you could give it a shot.
Not being able to close the descriptor is a pain in the butt. It's possible that this is a crufty little corner of Linux with no clean solution possible. You may have to make some device-specific calls.
Then it's not actually blocking (since you specified O_NONBLOCK), it's a true hang. See my other comment about using a signal to force the read()/close() to terminate early. Cross your fingers and hope it works.
Are you sure the driver for this device is meant to work in a hot-swap mode? Maybe the driver itself is getting confused when you pull the disk.
one last bit of information. I was able to drop into the kernel via kdb during the hang and this is the stack trace:
RSP RIP Function (args)
0x100b2ed9b08 0xffffffff80318138 schedule+0xb6e (0x1, 0x100bf95d240)
0x100b2ed9bd8 0xffffffff80318aab io_schedule+0x26 (0x100bf95d240, 0x0, 0x1,
0x100bb5bb030, 0xffffffff8015a026)
0x100b2ed9bf8 0xffffffff8015a54c __lock_page+0xbf (0x46, 0x0, 0x40000000, 0x10, 0x1)
0x100b2ed9c98 0xffffffff8015aaa7 do_generic_mapping_read+0x1f4 (0x0,
0x100bbb3fd68, 0x100a341b8c0, 0x100b2ed9f50, 0x0)
0x100b2ed9d98 0xffffffff8015c907 __generic_file_aio_read+0x181 (0x1,
0x100a341b8c0, 0x0, 0xffffffff00000001, 0x100a341b8c0)
0x100b2ed9e18 0xffffffff8015caa2 generic_file_read+0xbb (0x100a341b8c0, 0x1000,
0xf7d47000, 0xfffffffffffffff7, 0x0)
0x100b2ed9f18 0xffffffff80179a97 vfs_read+0xcf
0x100b2ed9f48 0xffffffff80179cee sys_read+0x45
Here's the important bit of that trace:
An attempt to lock a page of memory failed (the lock was already held elsewhere), so the scheduler is invoked to make the process wait. You'd have to figure out what other process is actually holding that lock. It may be the driver itself which is holding it. I'm beginning to suspect that the driver is not capable of properly handling the sudden removal of the device.
Anybody who can use a kernel debugger is going to get my attentionOut of curiosity, what kind of disk is this? Can you determine what driver is used? Look in /proc/ide or /proc/scsi or whatever is appropriate to the type of disk. If you can tell me the actual driver, I can poke around and see if there is some obscure ioctl() call that will help.
this is a disk which is part of an EMC CLARion array:
an inquiry on the disk shows (when its not pulled out):
Vendor Identification : DGC
Product Identification : RAID 0
Revision Number : 0219
Let me know if that is the info you meant. Otherwise I can see what else I can find to be more specific. thanks!
Sounds like the right piece of info. I'll see what I can find.
...
Well, I've been perusing the source code of this driver and it's a fairly large driver. There appear to be a zillion ioctl() calls, and I can't find any clear documentation either in the source code in the form of comments, or in the downloaded archive itself. Honestly I think this driver kind of sucks, but oh well. Try asking around on a qla2xxx-related mailing list to see if there is some ioctl() that either tells you the disconnect status of the drive, or at least configure the timeout so your program doesn't hang forever.
One other thing to try. While your program is hung, kill it with a SIGUSR1 signal. See if it pops out of the hang. If it does, then the alarm() technique I mentioned earlier should at least allow you to get out of the hang.
Last edited by brewbuck; 04-29-2007 at 03:19 PM.