Comparison of Sorting algorithm doubt

**rcgldr** · 06-21-2013

In the case of external sorting (on hard drive(s)), a copy of a post I made in another thread:

As far as I know, most external sorts use k-way merge sorts (with k = 8 or more) combined with large I/O's (large number of elements read or written per read or write) to reduce the overhead of random access on hard drives with higher k-way merge sorts.

I doubt any one uses tape drives for sorting anymore, but an alternative to merge sort was poly phase sort, a variation of merge sort that deliberately distributes data unevenly in a special pattern on the tape drives. Poly phase sorts could have been used on hard drives back when memory was limited, but on modern computers, there's enough memory to use large I/O's, reducing the overhead of increased random access in a higher k-way merge sort, and higher k-way merge sorts are faster than poly phase merge sorts (when k is 6 or greater, sometimes when k is 5).

On some mainframes, the I/O controllers can be programmed to write elements via pointers to the elements in a single I/O, speeding up a sort (via a modified device driver or special I/O call to the device driver). These type of controllers are called vectored I/O or scatter / gather. PC's have these type of controllers for hard drives, but scatter / gather is used for the scattered 4k blocks of physical memory that make up what appears to be a continuous virtual memory address space.

Wiki articles:

External sorting - Wikipedia, the free encyclopedia

Polyphase merge sort - Wikipedia, the free encyclopedia

Vectored I/O - Wikipedia, the free encyclopedia

**phantomotap** · 06-21-2013

a copy of a post I made in another thread

O_o

So, you linked to the same external resources, but didn't link to your own post?

Wow. That's really weird.

Soma

**rcgldr** · 06-21-2013

Originally Posted by phantomotap

So, you linked to the same external resources, but didn't link to your own post?

I wasn't sure about this, since replies to quote a linked post in another thread might have been awkward.

**phantomotap** · 06-21-2013

I wasn't sure about this, since replies to quote a linked post in another thread might have been awkward.

O_o

Well, yep, that is true; that would also be weird.

Soma

**Nominal Animal** · 06-22-2013

Originally Posted by rcgldr

On some mainframes, the I/O controllers can be programmed to write elements via pointers to the elements in a single I/O, speeding up a sort (via a modified device driver or special I/O call to the device driver).

POSIX.1-2001 provides readv() and writev(). The kernel scatters/gathers the data to/from various pointers in user space. This saves one memory copy and a temporary userspace buffer -- the application does not have to use a temporary buffer just to do large linear reads or writes to media. If you have fixed-size sort keys, with a pointer (and length) to the data related to the key, this is very useful. On 64-bit machines in particular, memory-mapping the data to be sorted (and only sorting an array of (key, pointer, length) items) should yield very good results.

The preadv() and pwritev() variants are supported in recent Linux and BSD variants, and allow multiple threads to do scatter reads and gather writes from multiple threads to the same file descriptor (because preadv()/pwritev() ignore the logical file position).

(Memory mapping huge files is not a problem in POSIX-like systems like Linux, on 64-bit architectures; see here for an example program I wrote in 2011, that manipulates a terabyte (1099511627776-byte) file-backed structure, just as if it was all in RAM. In Linux on ext4 et al. the backing file is sparse, and only uses about 250MB of actual disk space.)

POSIX.1-2001 and POSIX.1-2008 also provide asynchronous I/O, aio, via aio_read(), aio_write(), lio_listio(), and so on. AIO allows your application to scatter/gather from/to multiple file descriptors simultaneously. Instead of blocking until the operations complete, these functions only start (or tell the kernel about) the transfers, and your application is free to do any other work in the mean time. Unfortunately, the GNU C library implementation uses threads internally to provide this functionality, not the kernel async I/O facilities.

Those that have done benchmarking say that using blocking I/O and a separate thread for each file or socket descriptor (or two separate threads if simultaneously reading and writing from/to the same descriptor) is still the fastest way to go.

**rcgldr** · 06-22-2013

Originally Posted by rcgldr

On some mainframes, the I/O controllers can be programmed to write elements via pointers to the elements in a single I/O, speeding up a sort (via a modified device driver or special I/O call to the device driver). These type of controllers are called vectored I/O or scatter / gather. PC's have these type of controllers for hard drives, but scatter / gather is used for the scattered 4k blocks of physical memory that make up what appears to be a continuous virtual memory address space.

Originally Posted by Nominal Animal

POSIX.1-2001 provides readv() and writev(). The kernel scatters/gathers the data to/from various pointers in user space.

So this would take advantage of the scatter / gather hardware already present in PC controllers for hard drives (or any bus mastering controller) to deal with the virtual to physical mapping of a user's virtual address space. I wonder if windows provides similar functions. It's possible that some controller drivers might include an IOCTL type call to do this, but the equivalent of readv() and writev() would be nice. There was and still may be a limitation with SCSI controllers, some of which only had 17 decriptors, others that only had 255 descriptors, but ide / sata controllers don't seem to have any limit (apparently their descriptors can be updated during a single I/O operation via interrupt driven code if needed).

For merge sort, it's not needed for reads, but as mentioned it speeds up the writes, since sort or merge by index or pointer can be used without having to sort the actual data.

On a side note, if the record size is relatively small, it's probably faster to sort or merge data direcly than via indexes or pointers because of the additional hit on the cache to use indexes or pointers in addition to comparing the data.

**Nominal Animal** · 06-22-2013

Originally Posted by rcgldr

So this would take advantage of the scatter / gather hardware already present in PC controllers for hard drives (or any bus mastering controller) to deal with the virtual to physical mapping of a user's virtual address space.

(Background: Mass storage devices use fixed-size chunks. For SATA drives, currently either 512 (traditional) or 2048 (for 1GB+ drives) bytes per chunk AKA "sector". Without scatter/gather hardware, the drive only reads and writes full chunks. With scatter/gather hardware, the drive still reads and writes full chunks, but it can update their contents by reading or writing partial chunks scattered in RAM.)

At least in Linux, normally no. Most I/O is done in page-sized chunks, via the page cache.

It is possible some Linux direct I/O implementations and RAID drivers do scatter/gather DMA. Direct I/O bypasses the page cache. I'm not sure about hardware RAID drivers, because they're screwy, and it's been quite some time I last even looked at those parts of the Linux kernel source.

Of course, when you memory-map files, the map aperture is page aligned, so at least Linux does direct DMA to and from the mapping. No scatter/gather hardware needed.

Originally Posted by rcgldr

There was and still may be a limitation with SCSI controllers, some of which only had 17 decriptors, others that only had 255 descriptors

(Background: The controllers can have a number of simultaneously outstanding read or write operations; that's what rcgldr is referring to with descriptor. The controllers reorder the operations, in order to minimize the time it has to wait for the heads to get into position. Most drives reserve one surface for just positioning information, and their controllers know exactly where the head is positioned at any point in time over the platter. This is very important for maximizing the drive performance, and is a large part of why some drives are faster than others, even if they have the same platter rotation speed and data density. )

The fact that SATA drive caches are getting seriously large (16MB and 32MB typical on ones I use) means the queue depth does not matter that much.

The drives can do the DMA needed for a large number of operations -- much more than the queue depth; thousands --, and do the physical read/writes at whatever order it chooses, later. From the OS point of view, the operation has already happened, when the DMA is completed.

This works fine, in normal operation. The trouble is when the power goes out unexpectedly. If there is data in drive cache, those transfers are lost, and there is no way for the kernel to tell which ones they were.

Barriers are one option, but they're not always implemented in a way that helps with the above scenario. Turning off the cache is another.

I do believe the most used method currently is to wait for the drive cache to become empty (not just command queue empty, but the disk cache to become empty), before assuming that completed writes have actually hit the disk. (As usual, I might be wrong, though. I often am.)

**rcgldr** · 06-22-2013

Originally Posted by Nominal Animal

Mass storage devices use fixed-size chunks. For SATA drives, currently either 512 (traditional) or 2048 (for 1GB+ drives) bytes per chunk AKA "sector".

Since the days of the Western Digital WD1003 controller, which became the basis for PATA (IDE) and SATA, a single read or write operation can include multiple sectors. In order to use DMA or Bus Mastering controllers on systems with virtual paged memory, a call is made to the operating system, to load and lock the virtual memory buffer into physical memory and return a set of addresses and lengths (based on physical page locations). This information is then used to program the descriptors for the controller. In the case of old SCSI controllers with 17 descriptors, max I/O size was 64k (17 4k pages assuming non-aligned buffer), and with 255 descriptors, max I/O was 1MB - 8k. Before ATA 6 (2002), sector count register was 8 bits, with a value of zero meaning 256 sectors for a maximum transfer size of 128KB. Since ATA6 (2002), sector address can be 48 bits and sector count can be 16 bits, for a maximum transfer size of 32 MB (with 512 byte sector size).

**Nominal Animal** · 06-22-2013

Originally Posted by rcgldr

a single read or write operation can include multiple sectors

Absolutely; I think I phrased it wrong. (Shouldn't type this late at night.) I did mean sector granular, i.e. one or more complete consecutive sectors in each operation, if there is no hardware support for scatter/gather.

Originally Posted by rcgldr

In order to use DMA or Bus Mastering controllers on systems with virtual paged memory, a call is made to the operating system, to load and lock the virtual memory buffer into physical memory and return a set of addresses and lengths.

In Linux, the kernel does all that internally.

If the architecture has an IOMMU, the kernel can use the IOMMU to provide the device with linear addresses, so that multiple consecutive pages (from consecutive sectors on disk) can be read or written in one operation, even if the pages are not consecutive in physical memory.

We're getting into pretty esoteric stuff, here. I'm definitely not an expert on this, either. I only write code that uses these kernel services (and sometimes troubleshoot kernel bugs related to these), and try to make it go fast while still being reliable.

**rcgldr** · 06-23-2013

Originally Posted by rcgldr

In order to use DMA or Bus Mastering controllers on systems with virtual paged memory, a call is made to the operating system, to load and lock the virtual memory buffer into physical memory and return a set of addresses and lengths (based on physical page locations).

Originally Posted by Nominal Animal

In Linux, the kernel does all that internally ... IOMMU .

For Windows, device drivers use this function.

MmProbeAndLockPages routine (Windows Drivers)

Back in the days of Windows NT, a similar function was named VirtToPhys().

It's been a while since I messed with Windows device drivers, so I don't know what's been converted to use IOMMU.

**Nominal Animal** · 06-23-2013

Originally Posted by rcgldr

For Windows, device drivers use this function.

Right. In Linux, there is no boundary between device drivers and the kernel itself; no stable API or ABI.

If you are interested in this stuff, you might find the Linux Device Drivers, Third Edition interesting; the latter part of the Chapter 15, Memory Mapping and DMA, is relevant. Starting at page 450 in the book (39 in the PDF numbering) describes how drivers in Linux set up streaming scatter-gather DMA, if they know the device supports it.

Since it's such a common task (across many different types of drives), there's a common interface for it, and individual drivers do not need to implement it themselves from scratch.

I've probably given a poor picture of how things work in Linux land in my posts; the LDD3 book is definitely a better source

Thread: Comparison of Sorting algorithm doubt

Thread Tools

Search Thread

Display

Similar Threads

Sorting Algorithm Help

Euclid Algorithm (extended)Part2 Doubt about pointers - INITIALIZATION

Doubt regarding an Algorithm

sorting algorithm

'Sorting' algorithm