From: Nick Piggin Now that we are correctly kicking off kswapd early (before the synch reclaim watermark), it is really doing asynchronous pageout. This has exposed a latent problem where allocators running at the same time will make kswapd think it is getting into trouble, and cause too much swapping and suboptimal behaviour. This patch changes the kswapd scanning algorithm to use the same metrics for measuring pageout success as the synchronous reclaim path - namely, how much work is required to free SWAP_CLUSTER_MAX pages. This should make things less fragile all round, and has the added benefit that kswapd will continue running so long as memory is low and it is managing to free pages, rather than going through the full priority loop, then giving up. Should result in much better behaviour all round, especially when there are concurrent allocators. XXX before merging: * Ram Pai has indicated we're now doing better than 2.6.9-rc1 on 'DSS'. Need to have a look at results. * Need to check that it solves Ray Bryant's swapping troubles. * I need to run it through one or two swap regression tests. Signed-off-by: Nick Piggin Signed-off-by: Andrew Morton --- 25-akpm/mm/vmscan.c | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletion(-) diff -puN mm/vmscan.c~vm-no-wild-kswapd mm/vmscan.c --- 25/mm/vmscan.c~vm-no-wild-kswapd 2004-09-25 22:03:56.574259208 -0700 +++ 25-akpm/mm/vmscan.c 2004-09-25 22:03:56.580258296 -0700 @@ -993,10 +993,13 @@ static int balance_pgdat(pg_data_t *pgda int to_free = nr_pages; int priority; int i; - int total_scanned = 0, total_reclaimed = 0; + int total_scanned, total_reclaimed; struct reclaim_state *reclaim_state = current->reclaim_state; struct scan_control sc; +loop_again: + total_scanned = 0; + total_reclaimed = 0; sc.gfp_mask = GFP_KERNEL; sc.may_writepage = 0; sc.nr_mapped = read_page_state(nr_mapped); @@ -1095,6 +1098,15 @@ scan: */ if (total_scanned && priority < DEF_PRIORITY - 2) blk_congestion_wait(WRITE, HZ/10); + + /* + * We do this so kswapd doesn't build up large priorities for + * example when it is freeing in parallel with allocators. It + * matches the direct reclaim path behaviour in terms of impact + * on zone->*_priority. + */ + if (total_reclaimed >= SWAP_CLUSTER_MAX) + goto loop_again; } out: for (i = 0; i < pgdat->nr_zones; i++) { _