From: Nick Piggin <piggin@cyberone.com.au>

This brings biodoc.txt a bit more up to date with recent elevator changes.


 Documentation/block/biodoc.txt |  228 ++++++++++++++++++++++-------------------
 1 files changed, 126 insertions(+), 102 deletions(-)

diff -puN Documentation/block/biodoc.txt~elv-doc Documentation/block/biodoc.txt
--- 25/Documentation/block/biodoc.txt~elv-doc	2003-09-18 08:32:34.000000000 -0700
+++ 25-akpm/Documentation/block/biodoc.txt	2003-09-18 08:32:34.000000000 -0700
@@ -6,6 +6,8 @@ Notes Written on Jan 15, 2002:
 	Suparna Bhattacharya <suparna@in.ibm.com>
 
 Last Updated May 2, 2002
+September 2003: Updated I/O Scheduler portions
+	Nick Piggin <piggin@cyberone.com.au>
 
 Introduction:
 
@@ -220,42 +222,8 @@ i/o scheduling algorithm aspects and det
 It also makes it possible to completely hide the implementation details of
 the i/o scheduler from block drivers.
 
-New routines to be used instead of accessing the queue directly:
-
-elv_add_request: Should be called to queue a request
-elv_next_request: Should be called to pull of the next request to be serviced
-  of the queue. It takes care of several things like skipping active requests,
-  invoking the command pre-builder etc.
-
-Some new plugins:
-e->elevator_next_req_fn
-	Plugin called to extract the next request to service from the
-	queue
-e->elevator_add_req_fn
-	Plugin called to add a new request to the queue
-e->elevator_init_fn
-	Plugin called when initializing the queue
-e->elevator_exit_fn
-	Plugin called when destrying the queue
-
-Elevator Linus and Elevator noop are the existing combinations that can be
-directly used, but a driver can provide relevant callbacks, in case
-it needs to do something different.
-
-Elevator noop only attempts to merge requests, but doesn't reorder (sort)
-them. Even merging requires a linear scan today (except for the last merged
-hint case discussed later) though, which takes take up some CPU cycles.
-
-[Note: Merging usually helps in general, because there's usually non-trivial
-command overhead associated with setting up and starting a command. Sorting,
-on the other hand, may not be relevant for intelligent devices that reorder
-requests anyway]
-
-Elevator Linus attempts merging as well as sorting of requests on the queue.
-The sorting happens via an insert scan whenever a request comes in.
-Often some sorting still makes sense as the depth which most hardware can
-handle may be less than the queue lengths during i/o loads.
-
+I/O scheduler wrappers are to be used instead of accessing the queue directly.
+See section 4. The I/O scheduler for details.
 
 1.2 Tuning Based on High level code capabilities
 
@@ -317,32 +285,6 @@ Arjan's proposed request priority scheme
   requests. Some bits in the bi_rw flags field in the bio structure are
   intended to be used for this priority information.
 
-  Jens has an implementation of a simple deadline i/o scheduler that
-  makes a best effort attempt to start requests within a given expiry
-  time limit, along with trying to optimize disk seeks as in the current
-  elevator. It does this by sorting a request on two lists, one by
-  the deadline and one by the sector order. It employs a policy that
-  follows sector ordering as long as a deadline is not violated, and
-  tries to keep up with deadlines in so far as it can batch up to at
-  least a certain minimum number of sector ordered requests to reduce
-  arbitrary disk seeks. This implementation is constructed in a way
-  that makes it possible to support advanced compound i/o schedulers
-  as a combination of several low level schedulers with an overall
-  class-independent scheduler layered above.
-
-The current elevator scheme provides a latency bound over how many future
-requests can "pass" (get placed before) a given request, and this bound
-is determined by the request type (read, write). However, it doesn't
-prioritize a new request over existing requests in the queue based on its
-latency requirement. A new request could of course get serviced before
-earlier requests based on the position on disk which it accesses. This is
-due to the sort/merge in the  basic elevator scan logic, but irrespective
-of the request's own priority/latency value. Interestingly the elevator
-sequence or the latency bound setting of the new request is unaffected by the
-number of existing requests it has passed, i.e. doesn't depend on where
-it is positioned in the queue, but only on the number of requests that pass
-it in the future.
-
 
 1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode)
     (e.g Diagnostics, Systems Management)
@@ -964,7 +906,74 @@ Aside:
 
 
 4. The I/O scheduler
+I/O schedulers are now per queue. They should be runtime switchable and modular
+but aren't yet. Jens has most bits to do this, but the sysfs implementation is
+missing.
+
+A block layer call to the i/o scheduler follows the convention elv_xxx(). This
+calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh,
+xxx and xxx might not match exactly, but use your imagination. If an elevator
+doesn't implement a function, the switch does nothing or some minimal house
+keeping work.
+
+4.1. I/O scheduler API
+
+The functions an elevator may implement are: (* are mandatory)
+elevator_merge_fn		called to query requests for merge with a bio
+
+elevator_merge_req_fn		" " "  with another request
+
+elevator_merged_fn		called when a request in the scheduler has been
+				involved in a merge. It is used in the deadline
+				scheduler for example, to reposition the request
+				if its sorting order has changed.
+
+*elevator_next_req_fn		returns the next scheduled request, or NULL
+				if there are none (or none are ready).
+
+*elevator_add_req_fn		called to add a new request into the scheduler
+
+elevator_queue_empty_fn		returns true if the merge queue is empty.
+				Drivers shouldn't use this, but rather check
+				if elv_next_request is NULL (without losing the
+				request if one exists!)
+
+elevator_remove_req_fn		This is called when a driver claims ownership of
+				the target request - it now belongs to the
+				driver. It must not be modified or merged.
+				Drivers must not lose the request! A subsequent
+				call of elevator_next_req_fn must  return the
+				_next_ request.
+
+elevator_requeue_req_fn		called to add a request to the scheduler. This
+				is used when the request has alrnadebeen
+				returned by elv_next_request, but hasn't
+				completed. If this is not implemented then
+				elevator_add_req_fn is called instead.
+
+elevator_former_req_fn
+elevator_latter_req_fn		These return the request before or after the
+				one specified in disk sort order. Used by the
+				block layer to find merge possibilities.
+
+elevator_completed_req_fn	called when a request is completed. This might
+				come about due to being merged with another or
+				when the device completes the request.
+
+elevator_may_queue_fn		returns true if the scheduler wants to allow the
+				current context to queue a new request even if
+				it is over the queue limit. This must be used
+				very carefully!!
+
+elevator_set_req_fn
+elevator_put_req_fn		Must be used to allocate and free any elevator
+				specific storate for a request.
+
+elevator_init_fn
+elevator_exit_fn		Allocate and free any elevator specific storage
+				for a queue.
 
+4.2 I/O scheduler implementation
 The generic i/o scheduler algorithm attempts to sort/merge/batch requests for
 optimal disk scan and request servicing performance (based on generic
 principles and device capabilities), optimized for:
@@ -974,49 +983,58 @@ iii. better utilization of h/w & CPU tim
 
 Characteristics:
 
-i. Linked list for O(n) insert/merge (linear scan) right now
-
-This is just the same as it was in 2.4.
-
-There is however an added level of abstraction in the operations for adding
-and extracting a request to/from the queue, which makes it possible to
-try out alternative queue structures without changes to the users of the queue.
-Some things like head-active are thus now handled within elv_next_request
-making it possible to mark more than one request to be left alone.
-
-Aside:
-1. The use of a merge hash was explored to reduce merge times and to make
-   elevator_noop close to noop by avoiding the scan for merge. However the
-   complexity and locking issues introduced wasn't desirable especially as
-   with multi-page bios the incidence of merges is expected to be lower.
-2. The use of binomial/fibonacci heaps was explored to reduce the scan time;
-   however the idea was given up due to the complexity and added weight of
-   data structures, complications for handling barriers, as well as the
-   advantage of O(1) extraction and deletion (performance critical path) with
-   the existing list implementation vs heap based implementations.
-
-ii. Utilizes max_phys/hw_segments, and max_request_size  parameters, to merge
-    within the limits that the device can handle (See 3.2.2)
-
-iii. Last merge hint
-
-In 2.5, information about the last merge is saved as a hint for the subsequent
-request. This way, if sequential data is coming down the pipe, the hint can
-be used to speed up merges without going through a scan.
+i. Binary tree
+AS and deadline i/o schedulers use red black binary trees for disk position
+sorting and searching, and a fifo linked list for time-based searching. This
+gives good scalability and good availablility of information. Requests are
+almost always dispatched in disk sort order, so a cache is kept of the next
+request in sort order to prevent binary tree lookups.
+
+This arrangement is not a generic block layer characteristic however, so
+elevators may implement queues as they please.
+
+ii. Last merge hint
+The last merge hint is part of the generic queue layer. I/O schedulers must do
+some management on it. For the most part, the most important thing is to make
+sure q->last_merge is cleared (set to NULL) when the request on it is no longer
+a candidate for merging (for example if it has been sent to the driver).
+
+The last merge performed is cached as a hint for the subsequent request. If
+sequential data is being submitted, the hint is used to perform merges without
+any scanning. This is not sufficient when there are multiple processes doing
+I/O though, so a "merge hash" is used by some schedulers.
+
+iii. Merge hash
+AS and deadline use a hash table indexed by the last sector of a request. This
+enables merging code to quickly look up "back merge" candidates, even when
+multiple I/O streams are being performed at once on one disk.
+
+"Front merges", a new request being merged at the front of an existing request,
+are far less common than "back merges" due to the nature of most I/O patterns.
+Front merges are handled by the binary trees in AS and deadline schedulers.
 
 iv. Handling barrier cases
+A request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be ordered
+around. That is, they must be processed after all older requests, and before
+any newer ones. This includes merges!
+
+In AS and deadline schedulers, barriers have the effect of flushing the reorder
+queue. The performance cost of this will vary from nothing to a lot depending
+on i/o patterns and device characteristics. Obviously they won't improve
+performance, so their use should be kept to a minimum.
+
+v. Handling insertion position directives
+A request may be inserted with a position directive. The directives are one of
+ELEVATOR_INSERT_BACK, ELEVATOR_INSERT_FRONT, ELEVATOR_INSERT_SORT.
+
+ELEVATOR_INSERT_SORT is a general directive for non-barrier requests.
+ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue.
+ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, and
+overrides the ordering requested by any previous barriers. In practice this is
+harmless and required, because it is used for SCSI requeueing. This does not
+require flushing the reorder queue, so does not impose a performance penalty.
 
-As mentioned earlier, barrier support is new to 2.5, and the i/o scheduler
-has been modified accordingly.
-
-When a barrier comes in, then since insert happens in the form of a
-linear scan, starting from the end, it just needs to be ensured that this
-and future scans stops barrier point. This is achieved by  skipping the
-entire merge/scan logic for a barrier request, so it gets placed at the
-end of the queue, and specifying a zero latency for the request containing
-the bio so that no future requests can pass it.
-
-v. Plugging the queue to batch requests in anticipation of opportunities for
+vi. Plugging the queue to batch requests in anticipation of opportunities for
   merge/sort optimizations
 
 This is just the same as in 2.4 so far, though per-device unplugging
@@ -1051,6 +1069,12 @@ Aside:
   blk_kick_queue() to unplug a specific queue (right away ?)
   or optionally, all queues, is in the plan.
 
+4.3 I/O contexts
+I/O contexts provide a dynamically allocated per process data area. They may
+be used in I/O schedulers, and in the block layer (could be used for IO statis,
+priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and
+as-iosched.c for an example of usage in an i/o scheduler.
+
 
 5. Scalability related changes
 

_