From: David Howells Here's a patch that I worked out with Al Viro that adds support for a filesystem (such as kAFS) to perform automounting intrinsically without the need for a userspace daemon. It also adds support for such mountpoints to be degraded at the filesystem's behest until they've been untouched long enough that they'll be removed. I've a patch (to follow) that removes some #ifdef's from fs/afs/* thus allowing it to make use of this facility. There are five pieces to this: (1) Any interested filesystem needs to have at least one list to which expirable mountpoints can be added. Access to this list is governed by the vfsmount_lock. (2) When a filesystem wants to create an expirable mount, it calls do_kern_mount() to get a handle on the filesystem it wants mounting, and then calls do_add_mount() to mount that filesystem on the designated mountpoint, supplying the list mentioned in (1) to which the vfsmount will be added. In kAFS's case, the mountpoint is a directory with a follow_link() method defined (fs/afs/mntpt.c). This uses the struct nameidata supplied as an argument as a determination of where the new filesystem should be mounted. (3) When something using a vfsmount finishes dealing with it, it calls mntput(). This unmarks the vfsmount for immediate expiry. There are two criteria for determining if a vfsmount may be expired - it mustn't be marked as in use for anything other than being a child of another vfsmount, and it must have an expiry mark against it already. (4) The filesystem then determines the policy on expiring the mounts created in (2). When it feels the need to, it passes the list mentioned in (1) to mark_mounts_for_expiry() to request everything on the list be expired. This function examines each mount listed. If the vfsmount meets the criteria mentioned in (3), then the vfsmount is deleted from the namespace and disposed of as for unmounting; otherwise the vfsmount is left untouched apart from now bearing an expiration mark if it didn't before. kAFS's expiration policy is simply to invoke this process at regular intervals for all the mounts on its list. (5) An expiration facility is also provided to userspace: by calling umount() with a MNT_EXPIRE flag, it can make a request to unmount only if the mountpoint hasn't been used since the last request and isn't in use now. This allows expiration to be driven by userspace instead of by the kernel if that is desirable. This also means that do_umount() has to use a different version of path_release() to everyone else... it can't call mntput() as that clears the expiration flag, thus rendering this unachievable; so it's version of path_release() calls _mntput(), which doesn't do the clear. My original idea was to give the kernel more knowledge of automounted things. This avoids a certain problem with stat() on a mountpoint causing it to mount (for example, do "ls -l /afs" on a machine with kAFS), but Al wanted it done this way. > Why is autofs unsuitable? Because: (1) Autofs is flat; AFS requires a tree - mounts on mounts on mounts on mounts... (2) AFS holds the data as to what the mountpoints are and where they go, and these may be cross-links to subtrees beyond your control. It's also not trivial to extract a list of mountpoints as is required for autofs. (3) Autofs is not namespace safe. (4) Ducking back to userspace to get that to do the mount is pretty tricky if namespaces are involved. In fact, autofs may well want to make use of this facility. Signed-Off-By: David Howells Signed-off-by: Andrew Morton --- 25-akpm/Documentation/filesystems/automount-support.txt | 118 ++++++++ 25-akpm/fs/namei.c | 10 25-akpm/fs/namespace.c | 215 ++++++++++++++-- 25-akpm/fs/super.c | 3 25-akpm/include/linux/fs.h | 1 25-akpm/include/linux/mount.h | 21 + 25-akpm/include/linux/namei.h | 1 25-akpm/include/linux/namespace.h | 2 8 files changed, 350 insertions(+), 21 deletions(-) diff -puN /dev/null Documentation/filesystems/automount-support.txt --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/Documentation/filesystems/automount-support.txt Thu Jul 1 14:50:54 2004 @@ -0,0 +1,118 @@ +Support is available for filesystems that wish to do automounting support (such +as kAFS which can be found in fs/afs/). This facility includes allowing +in-kernel mounts to be performed and mountpoint degradation to be +requested. The latter can also be requested by userspace. + + +====================== +IN-KERNEL AUTOMOUNTING +====================== + +A filesystem can now mount another filesystem on one of its directories by the +following procedure: + + (1) Give the directory a follow_link() operation. + + When the directory is accessed, the follow_link op will be called, and + it will be provided with the location of the mountpoint in the nameidata + structure (vfsmount and dentry). + + (2) Have the follow_link() op do the following steps: + + (a) Call do_kern_mount() to call the appropriate filesystem to set up a + superblock and gain a vfsmount structure representing it. + + (b) Copy the nameidata provided as an argument and substitute the dentry + argument into it the copy. + + (c) Call do_add_mount() to install the new vfsmount into the namespace's + mountpoint tree, thus making it accessible to userspace. Use the + nameidata set up in (b) as the destination. + + If the mountpoint will be automatically expired, then do_add_mount() + should also be given the location of an expiration list (see further + down). + + (d) Release the path in the nameidata argument and substitute in the new + vfsmount and its root dentry. The ref counts on these will need + incrementing. + +Then from userspace, you can just do something like: + + [root@andromeda root]# mount -t afs \#root.afs. /afs + [root@andromeda root]# ls /afs + asd cambridge cambridge.redhat.com grand.central.org + [root@andromeda root]# ls /afs/cambridge + afsdoc + [root@andromeda root]# ls /afs/cambridge/afsdoc/ + ChangeLog html LICENSE pdf RELNOTES-1.2.2 + +And then if you look in the mountpoint catalogue, you'll see something like: + + [root@andromeda root]# cat /proc/mounts + ... + #root.afs. /afs afs rw 0 0 + #root.cell. /afs/cambridge.redhat.com afs rw 0 0 + #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0 + + +=========================== +AUTOMATIC MOUNTPOINT EXPIRY +=========================== + +Automatic expiration of mountpoints is easy, provided you've mounted the +mountpoint to be expired in the automounting procedure outlined above. + +To do expiration, you need to follow these steps: + + (3) Create at least one list off which the vfsmounts to be expired can be + hung. Access to this list will be governed by the vfsmount_lock. + + (4) In step (2c) above, the call to do_add_mount() should be provided with a + pointer to this list. It will hang the vfsmount off of it if it succeeds. + + (5) When you want mountpoints to be expired, call mark_mounts_for_expiry() + with a pointer to this list. This will process the list, marking every + vfsmount thereon for potential expiry on the next call. + + If a vfsmount was already flagged for expiry, and if its usage count is 1 + (it's only referenced by its parent vfsmount), then it will be deleted + from the namespace and thrown away (effectively unmounted). + + It may prove simplest to simply call this at regular intervals, using + some sort of timed event to drive it. + +The expiration flag is cleared by calls to mntput. This means that expiration +will only happen on the second expiration request after the last time the +mountpoint was accessed. + +If a mountpoint is moved, it gets removed from the expiration list. If a bind +mount is made on an expirable mount, the new vfsmount will not be on the +expiration list and will not expire. + +If a namespace is copied, all mountpoints contained therein will be copied, +and the copies of those that are on an expiration list will be added to the +same expiration list. + + +======================= +USERSPACE DRIVEN EXPIRY +======================= + +As an alternative, it is possible for userspace to request expiry of any +mountpoint (though some will be rejected - the current process's idea of the +rootfs for example). It does this by passing the MNT_EXPIRE flag to +umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH. + +If the mountpoint in question is in referenced by something other than +umount() or its parent mountpoint, an EBUSY error will be returned and the +mountpoint will not be marked for expiration or unmounted. + +If the mountpoint was not already marked for expiry at that time, an EAGAIN +error will be given and it won't be unmounted. + +Otherwise if it was already marked and it wasn't referenced, unmounting will +take place as usual. + +Again, the expiration flag is cleared every time anything other than umount() +looks at a mountpoint. diff -puN fs/namei.c~intrinsic-automount-and-mountpoint-degradation-support fs/namei.c --- 25/fs/namei.c~intrinsic-automount-and-mountpoint-degradation-support Thu Jul 1 14:50:54 2004 +++ 25-akpm/fs/namei.c Thu Jul 1 14:50:54 2004 @@ -279,6 +279,16 @@ void path_release(struct nameidata *nd) } /* + * umount() mustn't call path_release()/mntput() as that would clear + * mnt_expiry_mark + */ +void path_release_on_umount(struct nameidata *nd) +{ + dput(nd->dentry); + _mntput(nd->mnt); +} + +/* * Internal lookup() using the new generic dcache. * SMP-safe */ diff -puN fs/namespace.c~intrinsic-automount-and-mountpoint-degradation-support fs/namespace.c --- 25/fs/namespace.c~intrinsic-automount-and-mountpoint-degradation-support Thu Jul 1 14:50:54 2004 +++ 25-akpm/fs/namespace.c Thu Jul 1 14:50:54 2004 @@ -60,6 +60,7 @@ struct vfsmount *alloc_vfsmnt(const char INIT_LIST_HEAD(&mnt->mnt_child); INIT_LIST_HEAD(&mnt->mnt_mounts); INIT_LIST_HEAD(&mnt->mnt_list); + INIT_LIST_HEAD(&mnt->mnt_fslink); if (name) { int size = strlen(name)+1; char *newname = kmalloc(size, GFP_KERNEL); @@ -106,13 +107,9 @@ struct vfsmount *lookup_mnt(struct vfsmo EXPORT_SYMBOL(lookup_mnt); -static int check_mnt(struct vfsmount *mnt) +static inline int check_mnt(struct vfsmount *mnt) { - spin_lock(&vfsmount_lock); - while (mnt->mnt_parent != mnt) - mnt = mnt->mnt_parent; - spin_unlock(&vfsmount_lock); - return mnt == current->namespace->root; + return mnt->mnt_namespace == current->namespace; } static void detach_mnt(struct vfsmount *mnt, struct nameidata *old_nd) @@ -164,6 +161,14 @@ clone_mnt(struct vfsmount *old, struct d mnt->mnt_root = dget(root); mnt->mnt_mountpoint = mnt->mnt_root; mnt->mnt_parent = mnt; + mnt->mnt_namespace = old->mnt_namespace; + + /* stick the duplicate mount on the same expiry list + * as the original if that was on one */ + spin_lock(&vfsmount_lock); + if (!list_empty(&old->mnt_fslink)) + list_add(&mnt->mnt_fslink, &old->mnt_fslink); + spin_unlock(&vfsmount_lock); } return mnt; } @@ -346,6 +351,7 @@ void umount_tree(struct vfsmount *mnt) while (!list_empty(&kill)) { mnt = list_entry(kill.next, struct vfsmount, mnt_list); list_del_init(&mnt->mnt_list); + list_del_init(&mnt->mnt_fslink); if (mnt->mnt_parent == mnt) { spin_unlock(&vfsmount_lock); } else { @@ -369,6 +375,24 @@ static int do_umount(struct vfsmount *mn return retval; /* + * Allow userspace to request a mountpoint be expired rather than + * unmounting unconditionally. Unmount only happens if: + * (1) the mark is already set (the mark is cleared by mntput()) + * (2) the usage count == 1 [parent vfsmount] + 1 [sys_umount] + */ + if (flags & MNT_EXPIRE) { + if (mnt == current->fs->rootmnt || + flags & (MNT_FORCE | MNT_DETACH)) + return -EINVAL; + + if (atomic_read(&mnt->mnt_count) != 2) + return -EBUSY; + + if (!xchg(&mnt->mnt_expiry_mark, 1)) + return -EAGAIN; + } + + /* * If we may have to abort operations to get out of this * mount, and they will themselves hold resources we must * allow the fs to do things. In the Unix tradition of @@ -461,7 +485,7 @@ asmlinkage long sys_umount(char __user * retval = do_umount(nd.mnt, flags); dput_and_out: - path_release(&nd); + path_release_on_umount(&nd); out: return retval; } @@ -618,6 +642,11 @@ static int do_loopback(struct nameidata } if (mnt) { + /* stop bind mounts from expiring */ + spin_lock(&vfsmount_lock); + list_del_init(&mnt->mnt_fslink); + spin_unlock(&vfsmount_lock); + err = graft_tree(mnt, nd); if (err) { spin_lock(&vfsmount_lock); @@ -638,7 +667,8 @@ static int do_loopback(struct nameidata * on it - tough luck. */ -static int do_remount(struct nameidata *nd,int flags,int mnt_flags,void *data) +static int do_remount(struct nameidata *nd, int flags, int mnt_flags, + void *data) { int err; struct super_block * sb = nd->mnt->mnt_sb; @@ -710,6 +740,10 @@ static int do_move_mount(struct nameidat detach_mnt(old_nd.mnt, &parent_nd); attach_mnt(old_nd.mnt, nd); + + /* if the mount is moved, it should no longer be expire + * automatically */ + list_del_init(&old_nd.mnt->mnt_fslink); out2: spin_unlock(&vfsmount_lock); out1: @@ -722,11 +756,14 @@ out: return err; } -static int do_add_mount(struct nameidata *nd, char *type, int flags, +/* + * create a new mount for userspace and request it to be added into the + * namespace's tree + */ +static int do_new_mount(struct nameidata *nd, char *type, int flags, int mnt_flags, char *name, void *data) { struct vfsmount *mnt; - int err; if (!type || !memchr(type, 0, PAGE_SIZE)) return -EINVAL; @@ -736,9 +773,20 @@ static int do_add_mount(struct nameidata return -EPERM; mnt = do_kern_mount(type, flags, name, data); - err = PTR_ERR(mnt); if (IS_ERR(mnt)) - goto out; + return PTR_ERR(mnt); + + return do_add_mount(mnt, nd, mnt_flags, NULL); +} + +/* + * add a mount into a namespace's mount tree + * - provide the option of adding the new mount to an expiration list + */ +int do_add_mount(struct vfsmount *newmnt, struct nameidata *nd, + int mnt_flags, struct list_head *fslist) +{ + int err; down_write(¤t->namespace->sem); /* Something was mounted here while we slept */ @@ -750,22 +798,143 @@ static int do_add_mount(struct nameidata /* Refuse the same filesystem on the same mount point */ err = -EBUSY; - if (nd->mnt->mnt_sb == mnt->mnt_sb && nd->mnt->mnt_root == nd->dentry) + if (nd->mnt->mnt_sb == newmnt->mnt_sb && + nd->mnt->mnt_root == nd->dentry) goto unlock; err = -EINVAL; - if (S_ISLNK(mnt->mnt_root->d_inode->i_mode)) + if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode)) goto unlock; - mnt->mnt_flags = mnt_flags; - err = graft_tree(mnt, nd); + newmnt->mnt_flags = mnt_flags; + err = graft_tree(newmnt, nd); + + if (err == 0 && fslist) { + /* add to the specified expiration list */ + spin_lock(&vfsmount_lock); + list_add_tail(&newmnt->mnt_fslink, fslist); + spin_unlock(&vfsmount_lock); + } + unlock: up_write(¤t->namespace->sem); - mntput(mnt); -out: + mntput(newmnt); return err; } +EXPORT_SYMBOL_GPL(do_add_mount); + +/* + * process a list of expirable mountpoints with the intent of discarding any + * mountpoints that aren't in use and haven't been touched since last we came + * here + */ +void mark_mounts_for_expiry(struct list_head *mounts) +{ + struct namespace *namespace; + struct list_head graveyard, *_p, *_n; + struct vfsmount *mnt; + + if (list_empty(&mounts)) + return; + + INIT_LIST_HEAD(&graveyard); + + spin_lock(&vfsmount_lock); + + /* extract from the expiration list every vfsmount that matches the + * following criteria: + * - only referenced by its parent vfsmount + * - still marked for expiry (marked on the last call here; marks are + * cleared by mntput()) + */ + list_for_each_safe(_p, _n, mounts) { + mnt = list_entry(_p, struct vfsmount, mnt_fslink); + + if (!xchg(&mnt->mnt_expiry_mark, 1) || + atomic_read(&mnt->mnt_count) != 1) + continue; + + mntget(mnt); + list_move(&mnt->mnt_fslink, &graveyard); + } + + /* + * go through the vfsmounts we've just consigned to the graveyard to + * - check that they're still dead + * - delete the vfsmount from the appropriate namespace under lock + * - dispose of the corpse + */ + while (!list_empty(&graveyard)) { + mnt = list_entry(graveyard.next, struct vfsmount, mnt_fslink); + list_del_init(&mnt->mnt_fslink); + + /* don't do anything if the namespace is dead - all the + * vfsmounts from it are going away anyway */ + namespace = mnt->mnt_namespace; + if (!namespace || atomic_read(&namespace->count) <= 0) + continue; + get_namespace(namespace); + + spin_unlock(&vfsmount_lock); + down_write(&namespace->sem); + spin_lock(&vfsmount_lock); + + /* check that it is still dead: the count should now be 2 - as + * contributed by the vfsmount parent and the mntget above */ + if (atomic_read(&mnt->mnt_count) == 2) { + struct vfsmount *xdmnt; + struct dentry *xdentry; + + /* delete from the namespace */ + list_del_init(&mnt->mnt_list); + list_del_init(&mnt->mnt_child); + list_del_init(&mnt->mnt_hash); + mnt->mnt_mountpoint->d_mounted--; + + xdentry = mnt->mnt_mountpoint; + mnt->mnt_mountpoint = mnt->mnt_root; + xdmnt = mnt->mnt_parent; + mnt->mnt_parent = mnt; + + spin_unlock(&vfsmount_lock); + + mntput(xdmnt); + dput(xdentry); + + /* now lay it to rest if this was the last ref on the + * superblock */ + if (atomic_read(&mnt->mnt_sb->s_active) == 1) { + /* last instance - try to be smart */ + lock_kernel(); + DQUOT_OFF(mnt->mnt_sb); + acct_auto_close(mnt->mnt_sb); + unlock_kernel(); + } + + mntput(mnt); + } + else { + /* someone brought it back to life whilst we didn't + * have any locks held so return it to the expiration + * list */ + list_add_tail(&mnt->mnt_fslink, mounts); + spin_unlock(&vfsmount_lock); + } + + up_write(&namespace->sem); + + mntput(mnt); + put_namespace(namespace); + + spin_lock(&vfsmount_lock); + } + + spin_unlock(&vfsmount_lock); +} + +EXPORT_SYMBOL_GPL(mark_mounts_for_expiry); + int copy_mount_options (const void __user *data, unsigned long *where) { int i; @@ -860,7 +1029,7 @@ long do_mount(char * dev_name, char * di else if (flags & MS_MOVE) retval = do_move_mount(&nd, dev_name); else - retval = do_add_mount(&nd, type_page, flags, mnt_flags, + retval = do_new_mount(&nd, type_page, flags, mnt_flags, dev_name, data_page); dput_out: path_release(&nd); @@ -1185,6 +1354,7 @@ static void __init init_mount_tree(void) init_rwsem(&namespace->sem); list_add(&mnt->mnt_list, &namespace->list); namespace->root = mnt; + mnt->mnt_namespace = namespace; init_task.namespace = namespace; read_lock(&tasklist_lock); @@ -1252,8 +1422,15 @@ void __init mnt_init(unsigned long mempa void __put_namespace(struct namespace *namespace) { + struct vfsmount *mnt; + down_write(&namespace->sem); spin_lock(&vfsmount_lock); + + list_for_each_entry(mnt, &namespace->list, mnt_list) { + mnt->mnt_namespace = NULL; + } + umount_tree(namespace->root); spin_unlock(&vfsmount_lock); up_write(&namespace->sem); diff -puN fs/super.c~intrinsic-automount-and-mountpoint-degradation-support fs/super.c --- 25/fs/super.c~intrinsic-automount-and-mountpoint-degradation-support Thu Jul 1 14:50:54 2004 +++ 25-akpm/fs/super.c Thu Jul 1 14:50:54 2004 @@ -794,6 +794,7 @@ do_kern_mount(const char *fstype, int fl mnt->mnt_root = dget(sb->s_root); mnt->mnt_mountpoint = sb->s_root; mnt->mnt_parent = mnt; + mnt->mnt_namespace = current->namespace; up_write(&sb->s_umount); put_filesystem(type); return mnt; @@ -810,6 +811,8 @@ out: return (struct vfsmount *)sb; } +EXPORT_SYMBOL_GPL(do_kern_mount); + struct vfsmount *kern_mount(struct file_system_type *type) { return do_kern_mount(type->name, 0, type->name, NULL); diff -puN include/linux/fs.h~intrinsic-automount-and-mountpoint-degradation-support include/linux/fs.h --- 25/include/linux/fs.h~intrinsic-automount-and-mountpoint-degradation-support Thu Jul 1 14:50:54 2004 +++ 25-akpm/include/linux/fs.h Thu Jul 1 15:23:15 2004 @@ -728,6 +728,7 @@ extern int send_sigurg(struct fown_struc #define MNT_FORCE 0x00000001 /* Attempt to forcibily umount */ #define MNT_DETACH 0x00000002 /* Just detach from the tree */ +#define MNT_EXPIRE 0x00000004 /* Mark for expiry */ extern struct list_head super_blocks; extern spinlock_t sb_lock; diff -puN include/linux/mount.h~intrinsic-automount-and-mountpoint-degradation-support include/linux/mount.h --- 25/include/linux/mount.h~intrinsic-automount-and-mountpoint-degradation-support Thu Jul 1 14:50:54 2004 +++ 25-akpm/include/linux/mount.h Thu Jul 1 14:50:54 2004 @@ -29,8 +29,11 @@ struct vfsmount struct list_head mnt_child; /* and going through their mnt_child */ atomic_t mnt_count; int mnt_flags; + int mnt_expiry_mark; /* true if marked for expiry */ char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; + struct list_head mnt_fslink; /* link in fs-specific expiry list */ + struct namespace *mnt_namespace; /* containing namespace */ }; static inline struct vfsmount *mntget(struct vfsmount *mnt) @@ -42,7 +45,7 @@ static inline struct vfsmount *mntget(st extern void __mntput(struct vfsmount *mnt); -static inline void mntput(struct vfsmount *mnt) +static inline void _mntput(struct vfsmount *mnt) { if (mnt) { if (atomic_dec_and_test(&mnt->mnt_count)) @@ -50,10 +53,26 @@ static inline void mntput(struct vfsmoun } } +static inline void mntput(struct vfsmount *mnt) +{ + if (mnt) { + mnt->mnt_expiry_mark = 0; + _mntput(mnt); + } +} + extern void free_vfsmnt(struct vfsmount *mnt); extern struct vfsmount *alloc_vfsmnt(const char *name); extern struct vfsmount *do_kern_mount(const char *fstype, int flags, const char *name, void *data); + +struct nameidata; + +extern int do_add_mount(struct vfsmount *newmnt, struct nameidata *nd, + int mnt_flags, struct list_head *fslist); + +extern void mark_mounts_for_expiry(struct list_head *mounts); + extern spinlock_t vfsmount_lock; #endif diff -puN include/linux/namei.h~intrinsic-automount-and-mountpoint-degradation-support include/linux/namei.h --- 25/include/linux/namei.h~intrinsic-automount-and-mountpoint-degradation-support Thu Jul 1 14:50:54 2004 +++ 25-akpm/include/linux/namei.h Thu Jul 1 14:50:54 2004 @@ -61,6 +61,7 @@ extern int FASTCALL(path_lookup(const ch extern int FASTCALL(path_walk(const char *, struct nameidata *)); extern int FASTCALL(link_path_walk(const char *, struct nameidata *)); extern void path_release(struct nameidata *); +extern void path_release_on_umount(struct nameidata *); extern struct dentry * lookup_one_len(const char *, struct dentry *, int); extern struct dentry * lookup_hash(struct qstr *, struct dentry *); diff -puN include/linux/namespace.h~intrinsic-automount-and-mountpoint-degradation-support include/linux/namespace.h --- 25/include/linux/namespace.h~intrinsic-automount-and-mountpoint-degradation-support Thu Jul 1 14:50:54 2004 +++ 25-akpm/include/linux/namespace.h Thu Jul 1 14:50:54 2004 @@ -14,7 +14,7 @@ struct namespace { extern void umount_tree(struct vfsmount *); extern int copy_namespace(int, struct task_struct *); -void __put_namespace(struct namespace *namespace); +extern void __put_namespace(struct namespace *namespace); static inline void put_namespace(struct namespace *namespace) { _