Blame - Documentation/filesystems/vfs.rst - yocto/kernel/common

blob: a15527940b4612d18332eee6ea5d133172cf3ae6 [file] [log] [blame]

Tobin C. Harding	099c5c7	2019-05-15 10:29:10 +1000	[diff] [blame]	1	.. SPDX-License-Identifier: GPL-2.0
				2
Tobin C. Harding	90ac11a	2019-05-15 10:29:09 +1000	[diff] [blame]	3	=========================================
				4	Overview of the Linux Virtual File System
				5	=========================================
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	6
Tobin C. Harding	e66b045	2019-05-15 10:29:11 +1000	[diff] [blame]	7	Original author: Richard Gooch <rgooch@atnf.csiro.au>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	8
Tobin C. Harding	e66b045	2019-05-15 10:29:11 +1000	[diff] [blame]	9	- Copyright (C) 1999 Richard Gooch
				10	- Copyright (C) 2005 Pekka Enberg
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	11
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	12
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	13	Introduction
				14	============
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	15
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	16	The Virtual File System (also known as the Virtual Filesystem Switch) is
				17	the software layer in the kernel that provides the filesystem interface
				18	to userspace programs. It also provides an abstraction within the
				19	kernel which allows different filesystem implementations to coexist.
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	20
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	21	VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on
				22	are called from a process context. Filesystem locking is described in
Mauro Carvalho Chehab	ec23eb5	2019-07-26 09:51:27 -0300	[diff] [blame]	23	the document Documentation/filesystems/locking.rst.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	24
				25
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	26	Directory Entry Cache (dcache)
				27	------------------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	28
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	29	The VFS implements the open(2), stat(2), chmod(2), and similar system
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	30	calls. The pathname argument that is passed to them is used by the VFS
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	31	to search through the directory entry cache (also known as the dentry
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	32	cache or dcache). This provides a very fast look-up mechanism to
				33	translate a pathname (filename) into a specific dentry. Dentries live
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	34	in RAM and are never saved to disc: they exist only for performance.
				35
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	36	The dentry cache is meant to be a view into your entire filespace. As
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	37	most computers cannot fit all dentries in the RAM at the same time, some
				38	bits of the cache are missing. In order to resolve your pathname into a
				39	dentry, the VFS may have to resort to creating dentries along the way,
				40	and then loading the inode. This is done by looking up the inode.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	41
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	42
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	43	The Inode Object
				44	----------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	45
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	46	An individual dentry usually has a pointer to an inode. Inodes are
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	47	filesystem objects such as regular files, directories, FIFOs and other
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	48	beasts. They live either on the disc (for block device filesystems) or
				49	in the memory (for pseudo filesystems). Inodes that live on the disc
				50	are copied into the memory when required and changes to the inode are
				51	written back to disc. A single inode can be pointed to by multiple
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	52	dentries (hard links, for example, do this).
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	53
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	54	To look up an inode requires that the VFS calls the lookup() method of
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	55	the parent directory inode. This method is installed by the specific
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	56	filesystem implementation that the inode lives in. Once the VFS has the
				57	required dentry (and hence the inode), we can do all those boring things
				58	like open(2) the file, or stat(2) it to peek at the inode data. The
				59	stat(2) operation is fairly simple: once the VFS has the dentry, it
				60	peeks at the inode data and passes some of it back to userspace.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	61
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	62
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	63	The File Object
				64	---------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	65
				66	Opening a file requires another operation: allocation of a file
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	67	structure (this is the kernel-side implementation of file descriptors).
				68	The freshly allocated file structure is initialized with a pointer to
				69	the dentry and a set of file operation member functions. These are
				70	taken from the inode data. The open() file method is then called so the
				71	specific filesystem implementation can do its work. You can see that
				72	this is another switch performed by the VFS. The file structure is
				73	placed into the file descriptor table for the process.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	74
				75	Reading, writing and closing files (and other assorted VFS operations)
				76	is done by using the userspace file descriptor to grab the appropriate
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	77	file structure, and then calling the required file structure method to
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	78	do whatever is required. For as long as the file is open, it keeps the
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	79	dentry in use, which in turn means that the VFS inode is still in use.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	80
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	81
				82	Registering and Mounting a Filesystem
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	83	=====================================
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	84
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	85	To register and unregister a filesystem, use the following API
				86	functions:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	87
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	88	.. code-block:: c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	89
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	90	#include <linux/fs.h>
				91
				92	extern int register_filesystem(struct file_system_type *);
				93	extern int unregister_filesystem(struct file_system_type *);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	94
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	95	The passed struct file_system_type describes your filesystem. When a
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	96	request is made to mount a filesystem onto a directory in your
				97	namespace, the VFS will call the appropriate mount() method for the
				98	specific filesystem. New vfsmount referring to the tree returned by
				99	->mount() will be attached to the mountpoint, so that when pathname
				100	resolution reaches the mountpoint it will jump into the root of that
				101	vfsmount.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	102
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	103	You can see all filesystems that are registered to the kernel in the
				104	file /proc/filesystems.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	105
				106
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	107	struct file_system_type
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	108	-----------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	109
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	110	This describes the filesystem. As of kernel 2.6.39, the following
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	111	members are defined:
				112
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	113	.. code-block:: c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	114
Liao Pingfang	6a2195a	2021-01-10 15:59:59 +0800	[diff] [blame]	115	struct file_system_type {
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	116	const char *name;
				117	int fs_flags;
				118	struct dentry (mount) (struct file_system_type *, int,
				119	const char , void );
				120	void (kill_sb) (struct super_block );
				121	struct module *owner;
				122	struct file_system_type * next;
				123	struct list_head fs_supers;
				124	struct lock_class_key s_lock_key;
				125	struct lock_class_key s_umount_key;
				126	};
				127
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	128	``name``
				129	the name of the filesystem type, such as "ext2", "iso9660",
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	130	"msdos" and so on
				131
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	132	``fs_flags``
				133	various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	134
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	135	``mount``
				136	the method to call when a new instance of this filesystem should
				137	be mounted
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	138
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	139	``kill_sb``
				140	the method to call when an instance of this filesystem should be
				141	shut down
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	142
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	143
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	144	``owner``
				145	for internal VFS use: you should initialize this to THIS_MODULE
				146	in most cases.
				147
				148	``next``
				149	for internal VFS use: you should initialize this to NULL
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	150
Borislav Petkov	0746aec	2007-07-15 23:41:19 -0700	[diff] [blame]	151	s_lock_key, s_umount_key: lockdep-specific
				152
Al Viro	1a102ff	2011-03-16 09:07:58 -0400	[diff] [blame]	153	The mount() method has the following arguments:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	154
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	155	``struct file_system_type *fs_type``
				156	describes the filesystem, partly initialized by the specific
				157	filesystem code
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	158
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	159	``int flags``
				160	mount flags
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	161
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	162	``const char *dev_name``
				163	the device name we are mounting.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	164
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	165	``void *data``
				166	arbitrary mount options, usually comes as an ASCII string (see
				167	"Mount Options" section)
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	168
Al Viro	1a102ff	2011-03-16 09:07:58 -0400	[diff] [blame]	169	The mount() method must return the root dentry of the tree requested by
				170	caller. An active reference to its superblock must be grabbed and the
				171	superblock must be locked. On failure it should return ERR_PTR(error).
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	172
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	173	The arguments match those of mount(2) and their interpretation depends
				174	on filesystem type. E.g. for block filesystems, dev_name is interpreted
				175	as block device name, that device is opened and if it contains a
				176	suitable filesystem image the method creates and initializes struct
				177	super_block accordingly, returning its root dentry to caller.
Al Viro	1a102ff	2011-03-16 09:07:58 -0400	[diff] [blame]	178
				179	->mount() may choose to return a subtree of existing filesystem - it
				180	doesn't have to create a new one. The main result from the caller's
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	181	point of view is a reference to dentry at the root of (sub)tree to be
				182	attached; creation of new superblock is a common side effect.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	183
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	184	The most interesting member of the superblock structure that the mount()
				185	method fills in is the "s_op" field. This is a pointer to a "struct
				186	super_operations" which describes the next level of the filesystem
				187	implementation.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	188
Al Viro	1a102ff	2011-03-16 09:07:58 -0400	[diff] [blame]	189	Usually, a filesystem uses one of the generic mount() implementations
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	190	and provides a fill_super() callback instead. The generic variants are:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	191
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	192	``mount_bdev``
				193	mount a filesystem residing on a block device
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	194
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	195	``mount_nodev``
				196	mount a filesystem that is not backed by a device
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	197
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	198	``mount_single``
				199	mount a filesystem which shares the instance between all mounts
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	200
Al Viro	1a102ff	2011-03-16 09:07:58 -0400	[diff] [blame]	201	A fill_super() callback implementation has the following arguments:
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	202
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	203	``struct super_block *sb``
				204	the superblock structure. The callback must initialize this
				205	properly.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	206
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	207	``void *data``
				208	arbitrary mount options, usually comes as an ASCII string (see
				209	"Mount Options" section)
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	210
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	211	``int silent``
				212	whether or not to be silent on error
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	213
				214
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	215	The Superblock Object
				216	=====================
				217
				218	A superblock object represents a mounted filesystem.
				219
				220
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	221	struct super_operations
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	222	-----------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	223
				224	This describes how the VFS can manipulate the superblock of your
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	225	filesystem. As of kernel 2.6.22, the following members are defined:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	226
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	227	.. code-block:: c
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	228
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	229	struct super_operations {
				230	struct inode (alloc_inode)(struct super_block *sb);
				231	void (destroy_inode)(struct inode );
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	232
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	233	void (dirty_inode) (struct inode , int flags);
				234	int (write_inode) (struct inode , int);
				235	void (drop_inode) (struct inode );
				236	void (delete_inode) (struct inode );
				237	void (put_super) (struct super_block );
				238	int (sync_fs)(struct super_block sb, int wait);
				239	int (freeze_fs) (struct super_block );
				240	int (unfreeze_fs) (struct super_block );
				241	int (statfs) (struct dentry , struct kstatfs *);
				242	int (remount_fs) (struct super_block , int , char );
				243	void (clear_inode) (struct inode );
				244	void (umount_begin) (struct super_block );
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	245
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	246	int (show_options)(struct seq_file , struct dentry *);
				247
				248	ssize_t (quota_read)(struct super_block , int, char *, size_t, loff_t);
				249	ssize_t (quota_write)(struct super_block , int, const char *, size_t, loff_t);
				250	int (nr_cached_objects)(struct super_block );
				251	void (free_cached_objects)(struct super_block , int);
				252	};
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	253
				254	All methods are called without any locks being held, unless otherwise
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	255	noted. This means that most methods can block safely. All methods are
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	256	only called from a process context (i.e. not from an interrupt handler
				257	or bottom half).
				258
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	259	``alloc_inode``
				260	this method is called by alloc_inode() to allocate memory for
				261	struct inode and initialize it. If this function is not
Tobin C. Harding	50c1f43	2019-05-15 10:29:05 +1000	[diff] [blame]	262	defined, a simple 'struct inode' is allocated. Normally
				263	alloc_inode will be used to allocate a larger structure which
				264	contains a 'struct inode' embedded within it.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	265
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	266	``destroy_inode``
				267	this method is called by destroy_inode() to release resources
				268	allocated for struct inode. It is only required if
Tobin C. Harding	50c1f43	2019-05-15 10:29:05 +1000	[diff] [blame]	269	->alloc_inode was defined and simply undoes anything done by
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	270	->alloc_inode.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	271
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	272	``dirty_inode``
Eric Biggers	a38ed48	2021-01-12 11:02:48 -0800	[diff] [blame]	273	this method is called by the VFS when an inode is marked dirty.
				274	This is specifically for the inode itself being marked dirty,
				275	not its data. If the update needs to be persisted by fdatasync(),
				276	then I_DIRTY_DATASYNC will be set in the flags argument.
Lukas Czerner	0d94230	2022-08-25 12:06:57 +0200	[diff] [blame]	277	I_DIRTY_TIME will be set in the flags in case lazytime is enabled
				278	and struct inode has times updated since the last ->dirty_inode
				279	call.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	280
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	281	``write_inode``
				282	this method is called when the VFS needs to write an inode to
				283	disc. The second parameter indicates whether the write should
				284	be synchronous or not, not all filesystems check this flag.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	285
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	286	``drop_inode``
				287	called when the last access to the inode is dropped, with the
				288	inode->i_lock spinlock held.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	289
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	290	This method should be either NULL (normal UNIX filesystem
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	291	semantics) or "generic_delete_inode" (for filesystems that do
				292	not want to cache inodes - causing "delete_inode" to always be
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	293	called regardless of the value of i_nlink)
				294
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	295	The "generic_delete_inode()" behavior is equivalent to the old
				296	practice of using "force_delete" in the put_inode() case, but
				297	does not have the races that the "force_delete()" approach had.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	298
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	299	``delete_inode``
				300	called when the VFS wants to delete an inode
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	301
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	302	``put_super``
				303	called when the VFS wishes to free the superblock
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	304	(i.e. unmount). This is called with the superblock lock held
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	305
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	306	``sync_fs``
				307	called when VFS is writing out all dirty data associated with a
				308	superblock. The second parameter indicates whether the method
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	309	should wait until the write out has been completed. Optional.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	310
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	311	``freeze_fs``
				312	called when VFS is locking a filesystem and forcing it into a
				313	consistent state. This method is currently used by the Logical
				314	Volume Manager (LVM).
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	315
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	316	``unfreeze_fs``
				317	called when VFS is unlocking a filesystem and making it writable
Tobin C. Harding	50c1f43	2019-05-15 10:29:05 +1000	[diff] [blame]	318	again.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	319
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	320	``statfs``
				321	called when the VFS needs to get filesystem statistics.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	322
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	323	``remount_fs``
				324	called when the filesystem is remounted. This is called with
				325	the kernel lock held
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	326
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	327	``clear_inode``
				328	called then the VFS clears the inode. Optional
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	329
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	330	``umount_begin``
				331	called when the VFS is unmounting a filesystem.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	332
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	333	``show_options``
				334	called by the VFS to show mount options for /proc/<pid>/mounts.
				335	(see "Mount Options" section)
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	336
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	337	``quota_read``
				338	called by the VFS to read from filesystem quota file.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	339
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	340	``quota_write``
				341	called by the VFS to write to filesystem quota file.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	342
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	343	``nr_cached_objects``
				344	called by the sb cache shrinking function for the filesystem to
				345	return the number of freeable cached objects it contains.
Dave Chinner	0e1fdaf	2011-07-08 14:14:44 +1000	[diff] [blame]	346	Optional.
				347
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	348	``free_cache_objects``
				349	called by the sb cache shrinking function for the filesystem to
				350	scan the number of objects indicated to try to free them.
				351	Optional, but any filesystem implementing this method needs to
				352	also implement ->nr_cached_objects for it to be called
				353	correctly.
Dave Chinner	0e1fdaf	2011-07-08 14:14:44 +1000	[diff] [blame]	354
				355	We can't do anything with any errors that the filesystem might
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	356	encountered, hence the void return type. This will never be
				357	called if the VM is trying to reclaim under GFP_NOFS conditions,
				358	hence this method does not need to handle that situation itself.
Dave Chinner	0e1fdaf	2011-07-08 14:14:44 +1000	[diff] [blame]	359
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	360	Implementations must include conditional reschedule calls inside
				361	any scanning loop that is done. This allows the VFS to
				362	determine appropriate scan batch sizes without having to worry
				363	about whether implementations will cause holdoff problems due to
				364	large scan batch sizes.
Dave Chinner	8ab4766	2011-07-08 14:14:45 +1000	[diff] [blame]	365
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	366	Whoever sets up the inode is responsible for filling in the "i_op"
				367	field. This is a pointer to a "struct inode_operations" which describes
				368	the methods that can be performed on individual inodes.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	369
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	370
Andreas Gruenbacher	6c6ef9f	2016-09-29 17:48:44 +0200	[diff] [blame]	371	struct xattr_handlers
				372	---------------------
				373
				374	On filesystems that support extended attributes (xattrs), the s_xattr
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	375	superblock field points to a NULL-terminated array of xattr handlers.
				376	Extended attributes are name:value pairs.
Andreas Gruenbacher	6c6ef9f	2016-09-29 17:48:44 +0200	[diff] [blame]	377
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	378	``name``
				379	Indicates that the handler matches attributes with the specified
				380	name (such as "system.posix_acl_access"); the prefix field must
				381	be NULL.
Andreas Gruenbacher	6c6ef9f	2016-09-29 17:48:44 +0200	[diff] [blame]	382
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	383	``prefix``
				384	Indicates that the handler matches all attributes with the
				385	specified name prefix (such as "user."); the name field must be
				386	NULL.
Andreas Gruenbacher	6c6ef9f	2016-09-29 17:48:44 +0200	[diff] [blame]	387
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	388	``list``
				389	Determine if attributes matching this xattr handler should be
				390	listed for a particular dentry. Used by some listxattr
				391	implementations like generic_listxattr.
Andreas Gruenbacher	6c6ef9f	2016-09-29 17:48:44 +0200	[diff] [blame]	392
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	393	``get``
				394	Called by the VFS to get the value of a particular extended
				395	attribute. This method is called by the getxattr(2) system
				396	call.
Andreas Gruenbacher	6c6ef9f	2016-09-29 17:48:44 +0200	[diff] [blame]	397
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	398	``set``
				399	Called by the VFS to set the value of a particular extended
				400	attribute. When the new value is NULL, called to remove a
Randy Dunlap	8286de7	2020-07-03 14:43:25 -0700	[diff] [blame]	401	particular extended attribute. This method is called by the
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	402	setxattr(2) and removexattr(2) system calls.
Andreas Gruenbacher	6c6ef9f	2016-09-29 17:48:44 +0200	[diff] [blame]	403
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	404	When none of the xattr handlers of a filesystem match the specified
				405	attribute name or when a filesystem doesn't support extended attributes,
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	406	the various ``*xattr(2)`` system calls return -EOPNOTSUPP.
Andreas Gruenbacher	6c6ef9f	2016-09-29 17:48:44 +0200	[diff] [blame]	407
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	408
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	409	The Inode Object
				410	================
				411
				412	An inode object represents an object within the filesystem.
				413
				414
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	415	struct inode_operations
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	416	-----------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	417
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	418	This describes how the VFS can manipulate an inode in your filesystem.
				419	As of kernel 2.6.22, the following members are defined:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	420
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	421	.. code-block:: c
				422
				423	struct inode_operations {
Christian Brauner	549c729	2021-01-21 14:19:43 +0100	[diff] [blame]	424	int (create) (struct user_namespace , struct inode ,struct dentry , umode_t, bool);
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	425	struct dentry * (lookup) (struct inode ,struct dentry *, unsigned int);
				426	int (link) (struct dentry ,struct inode ,struct dentry );
				427	int (unlink) (struct inode ,struct dentry *);
Christian Brauner	549c729	2021-01-21 14:19:43 +0100	[diff] [blame]	428	int (symlink) (struct user_namespace , struct inode ,struct dentry ,const char *);
				429	int (mkdir) (struct user_namespace , struct inode ,struct dentry ,umode_t);
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	430	int (rmdir) (struct inode ,struct dentry *);
Christian Brauner	549c729	2021-01-21 14:19:43 +0100	[diff] [blame]	431	int (mknod) (struct user_namespace , struct inode ,struct dentry ,umode_t,dev_t);
				432	int (rename) (struct user_namespace , struct inode , struct dentry ,
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	433	struct inode , struct dentry , unsigned int);
				434	int (readlink) (struct dentry , char __user *,int);
				435	const char (get_link) (struct dentry , struct inode ,
				436	struct delayed_call *);
Christian Brauner	549c729	2021-01-21 14:19:43 +0100	[diff] [blame]	437	int (permission) (struct user_namespace , struct inode *, int);
Miklos Szeredi	0cad624	2021-08-18 22:08:24 +0200	[diff] [blame]	438	struct posix_acl * (get_acl)(struct inode , int, bool);
Christian Brauner	549c729	2021-01-21 14:19:43 +0100	[diff] [blame]	439	int (setattr) (struct user_namespace , struct dentry , struct iattr );
				440	int (getattr) (struct user_namespace , const struct path , struct kstat , u32, unsigned int);
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	441	ssize_t (listxattr) (struct dentry , char *, size_t);
				442	void (update_time)(struct inode , struct timespec *, int);
				443	int (atomic_open)(struct inode , struct dentry , struct file ,
				444	unsigned open_flag, umode_t create_mode);
Christian Brauner	549c729	2021-01-21 14:19:43 +0100	[diff] [blame]	445	int (tmpfile) (struct user_namespace , struct inode , struct dentry , umode_t);
				446	int (set_acl)(struct user_namespace , struct inode , struct posix_acl , int);
Miklos Szeredi	4c5b479	2021-04-07 14:36:42 +0200	[diff] [blame]	447	int (fileattr_set)(struct user_namespace mnt_userns,
				448	struct dentry dentry, struct fileattr fa);
				449	int (fileattr_get)(struct dentry dentry, struct fileattr *fa);
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	450	};
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	451
				452	Again, all methods are called without any locks being held, unless
				453	otherwise noted.
				454
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	455	``create``
				456	called by the open(2) and creat(2) system calls. Only required
				457	if you want to support regular files. The dentry you get should
				458	not have an inode (i.e. it should be a negative dentry). Here
				459	you will probably call d_instantiate() with the dentry and the
				460	newly created inode
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	461
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	462	``lookup``
				463	called when the VFS needs to look up an inode in a parent
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	464	directory. The name to look for is found in the dentry. This
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	465	method must call d_add() to insert the found inode into the
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	466	dentry. The "i_count" field in the inode structure should be
				467	incremented. If the named inode does not exist a NULL inode
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	468	should be inserted into the dentry (this is called a negative
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	469	dentry). Returning an error code from this routine must only be
				470	done on a real error, otherwise creating inodes with system
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	471	calls like create(2), mknod(2), mkdir(2) and so on will fail.
				472	If you wish to overload the dentry methods then you should
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	473	initialise the "d_dop" field in the dentry; this is a pointer to
				474	a struct "dentry_operations". This method is called with the
				475	directory inode semaphore held
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	476
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	477	``link``
				478	called by the link(2) system call. Only required if you want to
				479	support hard links. You will probably need to call
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	480	d_instantiate() just as you would in the create() method
				481
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	482	``unlink``
				483	called by the unlink(2) system call. Only required if you want
				484	to support deleting inodes
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	485
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	486	``symlink``
				487	called by the symlink(2) system call. Only required if you want
				488	to support symlinks. You will probably need to call
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	489	d_instantiate() just as you would in the create() method
				490
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	491	``mkdir``
				492	called by the mkdir(2) system call. Only required if you want
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	493	to support creating subdirectories. You will probably need to
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	494	call d_instantiate() just as you would in the create() method
				495
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	496	``rmdir``
				497	called by the rmdir(2) system call. Only required if you want
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	498	to support deleting subdirectories
				499
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	500	``mknod``
				501	called by the mknod(2) system call to create a device (char,
				502	block) inode or a named pipe (FIFO) or socket. Only required if
				503	you want to support creating these types of inodes. You will
				504	probably need to call d_instantiate() just as you would in the
				505	create() method
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	506
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	507	``rename``
				508	called by the rename(2) system call to rename the object to have
				509	the parent and name given by the second inode and dentry.
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	510
Miklos Szeredi	18fc84d	2016-09-27 11:03:58 +0200	[diff] [blame]	511	The filesystem must return -EINVAL for any unsupported or
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	512	unknown flags. Currently the following flags are implemented:
				513	(1) RENAME_NOREPLACE: this flag indicates that if the target of
				514	the rename exists the rename should fail with -EEXIST instead of
				515	replacing the target. The VFS already checks for existence, so
				516	for local filesystems the RENAME_NOREPLACE implementation is
				517	equivalent to plain rename.
Miklos Szeredi	520c8b1	2014-04-01 17:08:42 +0200	[diff] [blame]	518	(2) RENAME_EXCHANGE: exchange source and target. Both must
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	519	exist; this is checked by the VFS. Unlike plain rename, source
				520	and target may be of different type.
Miklos Szeredi	520c8b1	2014-04-01 17:08:42 +0200	[diff] [blame]	521
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	522	``get_link``
				523	called by the VFS to follow a symbolic link to the inode it
				524	points to. Only required if you want to support symbolic links.
				525	This method returns the symlink body to traverse (and possibly
				526	resets the current position with nd_jump_link()). If the body
				527	won't go away until the inode is gone, nothing else is needed;
				528	if it needs to be otherwise pinned, arrange for its release by
				529	having get_link(..., ..., done) do set_delayed_call(done,
				530	destructor, argument). In that case destructor(argument) will
				531	be called once VFS is done with the body you've returned. May
				532	be called in RCU mode; that is indicated by NULL dentry
Al Viro	fceef39	2015-12-29 15:58:39 -0500	[diff] [blame]	533	argument. If request can't be handled without leaving RCU mode,
				534	have it return ERR_PTR(-ECHILD).
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	535
Eric Biggers	dcb2cb1	2019-04-11 16:16:28 -0700	[diff] [blame]	536	If the filesystem stores the symlink target in ->i_link, the
				537	VFS may use it directly without calling ->get_link(); however,
				538	->get_link() must still be provided. ->i_link must not be
				539	freed until after an RCU grace period. Writing to ->i_link
				540	post-iget() time requires a 'release' memory barrier.
				541
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	542	``readlink``
				543	this is now just an override for use by readlink(2) for the
Miklos Szeredi	76fca90	2016-12-09 16:45:04 +0100	[diff] [blame]	544	cases when ->get_link uses nd_jump_link() or object is not in
				545	fact a symlink. Normally filesystems should only implement
				546	->get_link for symlinks and readlink(2) will automatically use
				547	that.
				548
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	549	``permission``
				550	called by the VFS to check for access rights on a POSIX-like
Tobin C. Harding	50c1f43	2019-05-15 10:29:05 +1000	[diff] [blame]	551	filesystem.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	552
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	553	May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If in
				554	rcu-walk mode, the filesystem must check the permission without
				555	blocking or storing to the inode.
Nick Piggin	b74c79e	2011-01-07 17:49:58 +1100	[diff] [blame]	556
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	557	If a situation is encountered that rcu-walk cannot handle,
				558	return
Nick Piggin	b74c79e	2011-01-07 17:49:58 +1100	[diff] [blame]	559	-ECHILD and it will be called again in ref-walk mode.
				560
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	561	``setattr``
				562	called by the VFS to set attributes for a file. This method is
				563	called by chmod(2) and related system calls.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	564
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	565	``getattr``
				566	called by the VFS to get attributes of a file. This method is
				567	called by stat(2) and related system calls.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	568
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	569	``listxattr``
				570	called by the VFS to list all extended attributes for a given
				571	file. This method is called by the listxattr(2) system call.
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	572
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	573	``update_time``
				574	called by the VFS to update a specific time or the i_version of
				575	an inode. If this is not defined the VFS will update the inode
				576	itself and call mark_inode_dirty_sync.
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	577
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	578	``atomic_open``
				579	called on the last component of an open. Using this optional
				580	method the filesystem can look up, possibly create and open the
				581	file in one atomic operation. If it wants to leave actual
				582	opening to the caller (e.g. if the file turned out to be a
				583	symlink, device, or just something filesystem won't do atomic
				584	open for), it may signal this by returning finish_no_open(file,
				585	dentry). This method is only called if the last component is
				586	negative or needs lookup. Cached positive dentries are still
				587	handled by f_op->open(). If the file was created, FMODE_CREATED
				588	flag should be set in file->f_mode. In case of O_EXCL the
				589	method must only succeed if the file didn't exist and hence
				590	FMODE_CREATED shall always be set on success.
Miklos Szeredi	d18e900	2012-06-05 15:10:17 +0200	[diff] [blame]	591
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	592	``tmpfile``
				593	called in the end of O_TMPFILE open(). Optional, equivalent to
				594	atomically creating, opening and unlinking a file in given
				595	directory.
Al Viro	48bde8d	2013-07-03 16:19:23 +0400	[diff] [blame]	596
Miklos Szeredi	4c5b479	2021-04-07 14:36:42 +0200	[diff] [blame]	597	``fileattr_get``
				598	called on ioctl(FS_IOC_GETFLAGS) and ioctl(FS_IOC_FSGETXATTR) to
				599	retrieve miscellaneous file flags and attributes. Also called
				600	before the relevant SET operation to check what is being changed
				601	(in this case with i_rwsem locked exclusive). If unset, then
				602	fall back to f_op->ioctl().
				603
				604	``fileattr_set``
				605	called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to
				606	change miscellaneous file flags and attributes. Callers hold
				607	i_rwsem exclusive. If unset, then fall back to f_op->ioctl().
				608
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	609
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	610	The Address Space Object
				611	========================
				612
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	613	The address space object is used to group and manage pages in the page
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	614	cache. It can be used to keep track of the pages in a file (or anything
				615	else) and also track the mapping of sections of the file into process
				616	address spaces.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	617
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	618	There are a number of distinct yet related services that an
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	619	address-space can provide. These include communicating memory pressure,
				620	page lookup by address, and keeping track of pages tagged as Dirty or
				621	Writeback.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	622
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	623	The first can be used independently to the others. The VM can try to
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	624	either write dirty pages in order to clean them, or release clean pages
				625	in order to reuse them. To do this it can call the ->writepage method
				626	on dirty pages, and ->releasepage on clean pages with PagePrivate set.
				627	Clean pages without PagePrivate and with no external references will be
				628	released without notice being given to the address_space.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	629
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	630	To achieve this functionality, pages need to be placed on an LRU with
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	631	lru_cache_add and mark_page_active needs to be called whenever the page
				632	is used.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	633
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	634	Pages are normally kept in a radix tree index by ->index. This tree
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	635	maintains information about the PG_Dirty and PG_Writeback status of each
				636	page, so that pages with either of these flags can be found quickly.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	637
				638	The Dirty tag is primarily used by mpage_writepages - the default
				639	->writepages method. It uses the tag to find dirty pages to call
				640	->writepage on. If mpage_writepages is not used (i.e. the address
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	641	provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost
				642	unused. write_inode_now and sync_inode do use it (through
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	643	__sync_single_inode) to check if ->writepages has been successful in
				644	writing out the whole address_space.
				645
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	646	The Writeback tag is used by filemapwait and sync_page* functions, via
				647	filemap_fdatawait_range, to wait for all writeback to complete.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	648
				649	An address_space handler may attach extra information to a page,
				650	typically using the 'private' field in the 'struct page'. If such
				651	information is attached, the PG_Private flag should be set. This will
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	652	cause various VM routines to make extra calls into the address_space
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	653	handler to deal with that data.
				654
				655	An address space acts as an intermediate between storage and
				656	application. Data is read into the address space a whole page at a
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	657	time, and provided to the application either by copying of the page, or
				658	by memory-mapping the page. Data is written into the address space by
				659	the application, and then written-back to storage typically in whole
				660	pages, however the address_space has finer control of write sizes.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	661
				662	The read process essentially only requires 'readpage'. The write
Nick Piggin	4e02ed4	2008-10-29 14:00:55 -0700	[diff] [blame]	663	process is more complicated and uses write_begin/write_end or
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	664	set_page_dirty to write data into the address_space, and writepage and
				665	writepages to writeback data to storage.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	666
				667	Adding and removing pages to/from an address_space is protected by the
				668	inode's i_mutex.
				669
				670	When data is written to a page, the PG_Dirty flag should be set. It
				671	typically remains set until writepage asks for it to be written. This
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	672	should clear PG_Dirty and set PG_Writeback. It can be actually written
				673	at any point after PG_Dirty is clear. Once it is known to be safe,
				674	PG_Writeback is cleared.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	675
Jeff Layton	acbf3c3	2017-07-06 07:02:27 -0400	[diff] [blame]	676	Writeback makes use of a writeback_control structure to direct the
Randy Dunlap	8286de7	2020-07-03 14:43:25 -0700	[diff] [blame]	677	operations. This gives the writepage and writepages operations some
Jeff Layton	acbf3c3	2017-07-06 07:02:27 -0400	[diff] [blame]	678	information about the nature of and reason for the writeback request,
				679	and the constraints under which it is being done. It is also used to
				680	return information back to the caller about the result of a writepage or
				681	writepages request.
				682
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	683
Jeff Layton	acbf3c3	2017-07-06 07:02:27 -0400	[diff] [blame]	684	Handling errors during writeback
				685	--------------------------------
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	686
Jeff Layton	acbf3c3	2017-07-06 07:02:27 -0400	[diff] [blame]	687	Most applications that do buffered I/O will periodically call a file
				688	synchronization call (fsync, fdatasync, msync or sync_file_range) to
				689	ensure that data written has made it to the backing store. When there
				690	is an error during writeback, they expect that error to be reported when
				691	a file sync request is made. After an error has been reported on one
				692	request, subsequent requests on the same file descriptor should return
				693	0, unless further writeback errors have occurred since the previous file
				694	syncronization.
				695
				696	Ideally, the kernel would report errors only on file descriptions on
				697	which writes were done that subsequently failed to be written back. The
				698	generic pagecache infrastructure does not track the file descriptions
				699	that have dirtied each individual page however, so determining which
				700	file descriptors should get back an error is not possible.
				701
				702	Instead, the generic writeback error tracking infrastructure in the
				703	kernel settles for reporting errors to fsync on all file descriptions
				704	that were open at the time that the error occurred. In a situation with
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	705	multiple writers, all of them will get back an error on a subsequent
				706	fsync, even if all of the writes done through that particular file
				707	descriptor succeeded (or even if there were no writes on that file
				708	descriptor at all).
Jeff Layton	acbf3c3	2017-07-06 07:02:27 -0400	[diff] [blame]	709
				710	Filesystems that wish to use this infrastructure should call
				711	mapping_set_error to record the error in the address_space when it
				712	occurs. Then, after writing back data from the pagecache in their
				713	file->fsync operation, they should call file_check_and_advance_wb_err to
				714	ensure that the struct file's error cursor has advanced to the correct
				715	point in the stream of errors emitted by the backing device(s).
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	716
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	717
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	718	struct address_space_operations
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	719	-------------------------------
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	720
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	721	This describes how the VFS can manipulate mapping of a file to page
				722	cache in your filesystem. The following members are defined:
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	723
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	724	.. code-block:: c
				725
				726	struct address_space_operations {
				727	int (writepage)(struct page page, struct writeback_control *wbc);
				728	int (readpage)(struct file , struct page *);
				729	int (writepages)(struct address_space , struct writeback_control *);
				730	int (set_page_dirty)(struct page page);
Matthew Wilcox (Oracle)	8151b4c	2020-06-01 21:46:44 -0700	[diff] [blame]	731	void (readahead)(struct readahead_control );
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	732	int (readpages)(struct file filp, struct address_space *mapping,
				733	struct list_head *pages, unsigned nr_pages);
				734	int (write_begin)(struct file , struct address_space *mapping,
				735	loff_t pos, unsigned len, unsigned flags,
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	736	struct page pagep, void fsdata);
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	737	int (write_end)(struct file , struct address_space *mapping,
				738	loff_t pos, unsigned len, unsigned copied,
				739	struct page page, void fsdata);
				740	sector_t (bmap)(struct address_space , sector_t);
				741	void (invalidatepage) (struct page , unsigned int, unsigned int);
				742	int (releasepage) (struct page , int);
				743	void (freepage)(struct page );
				744	ssize_t (direct_IO)(struct kiocb , struct iov_iter *iter);
				745	/* isolate a page for migration */
				746	bool (isolate_page) (struct page , isolate_mode_t);
				747	/* migrate the contents of a page to the specified target */
				748	int (migratepage) (struct page , struct page *);
				749	/* put migration-failed page back to right list */
				750	void (putback_page) (struct page );
				751	int (launder_page) (struct page );
Minchan Kim	bda807d	2016-07-26 15:23:05 -0700	[diff] [blame]	752
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	753	int (is_partially_uptodate) (struct page , unsigned long,
				754	unsigned long);
				755	void (is_dirty_writeback) (struct page , bool , bool );
				756	int (error_remove_page) (struct mapping mapping, struct page *page);
				757	int (swap_activate)(struct file );
				758	int (swap_deactivate)(struct file );
				759	};
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	760
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	761	``writepage``
				762	called by the VM to write a dirty page to backing store. This
				763	may happen for data integrity reasons (i.e. 'sync'), or to free
				764	up memory (flush). The difference can be seen in
				765	wbc->sync_mode. The PG_Dirty flag has been cleared and
				766	PageLocked is true. writepage should start writeout, should set
				767	PG_Writeback, and should make sure the page is unlocked, either
				768	synchronously or asynchronously when the write operation
				769	completes.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	770
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	771	If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
				772	try too hard if there are problems, and may choose to write out
				773	other pages from the mapping if that is easier (e.g. due to
				774	internal dependencies). If it chooses not to start writeout, it
				775	should return AOP_WRITEPAGE_ACTIVATE so that the VM will not
				776	keep calling ->writepage on that page.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	777
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	778	See the file "Locking" for more details.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	779
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	780	``readpage``
				781	called by the VM to read a page from backing store. The page
				782	will be Locked when readpage is called, and should be unlocked
				783	and marked uptodate once the read completes. If ->readpage
				784	discovers that it needs to unlock the page for some reason, it
				785	can do so, and then return AOP_TRUNCATED_PAGE. In this case,
				786	the page will be relocated, relocked and if that all succeeds,
				787	->readpage will be called again.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	788
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	789	``writepages``
				790	called by the VM to write out pages associated with the
Julia Lawall	e9b2f15	2020-07-26 21:22:21 +0200	[diff] [blame]	791	address_space object. If wbc->sync_mode is WB_SYNC_ALL, then
Tobin C. Harding	50c1f43	2019-05-15 10:29:05 +1000	[diff] [blame]	792	the writeback_control will specify a range of pages that must be
Julia Lawall	e9b2f15	2020-07-26 21:22:21 +0200	[diff] [blame]	793	written out. If it is WB_SYNC_NONE, then a nr_to_write is
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	794	given and that many pages should be written if possible. If no
				795	->writepages is given, then mpage_writepages is used instead.
				796	This will choose pages from the address space that are tagged as
				797	DIRTY and will pass them to ->writepage.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	798
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	799	``set_page_dirty``
				800	called by the VM to set a page dirty. This is particularly
				801	needed if an address space attaches private data to a page, and
				802	that data needs to be updated when a page is dirtied. This is
				803	called, for example, when a memory mapped page gets modified.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	804	If defined, it should set the PageDirty flag, and the
Tobin C. Harding	1b44ae6	2019-05-15 10:29:12 +1000	[diff] [blame]	805	PAGECACHE_TAG_DIRTY tag in the radix tree.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	806
Matthew Wilcox (Oracle)	8151b4c	2020-06-01 21:46:44 -0700	[diff] [blame]	807	``readahead``
				808	Called by the VM to read pages associated with the address_space
				809	object. The pages are consecutive in the page cache and are
				810	locked. The implementation should decrement the page refcount
				811	after starting I/O on each page. Usually the page will be
				812	unlocked by the I/O completion handler. If the filesystem decides
				813	to stop attempting I/O before reaching the end of the readahead
				814	window, it can simply return. The caller will decrement the page
				815	refcount and unlock the remaining pages for you. Set PageUptodate
				816	if the I/O completes successfully. Setting PageError on any page
				817	will be ignored; simply unlock the page if an I/O error occurs.
				818
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	819	``readpages``
				820	called by the VM to read pages associated with the address_space
				821	object. This is essentially just a vector version of readpage.
				822	Instead of just one page, several pages are requested.
NeilBrown	a9e102b	2006-03-25 03:08:29 -0800	[diff] [blame]	823	readpages is only used for read-ahead, so read errors are
Tobin C. Harding	50c1f43	2019-05-15 10:29:05 +1000	[diff] [blame]	824	ignored. If anything goes wrong, feel free to give up.
Matthew Wilcox (Oracle)	8151b4c	2020-06-01 21:46:44 -0700	[diff] [blame]	825	This interface is deprecated and will be removed by the end of
				826	2020; implement readahead instead.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	827
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	828	``write_begin``
				829	Called by the generic buffered write code to ask the filesystem
				830	to prepare to write len bytes at the given offset in the file.
				831	The address_space should check that the write will be able to
				832	complete, by allocating space if necessary and doing any other
				833	internal housekeeping. If the write will update parts of any
				834	basic-blocks on storage, then those blocks should be pre-read
				835	(if they haven't been read already) so that the updated blocks
				836	can be written out properly.
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	837
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	838	The filesystem must return the locked pagecache page for the
				839	specified offset, in ``*pagep``, for the caller to write into.
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	840
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	841	It must be able to cope with short writes (where the length
				842	passed to write_begin is greater than the number of bytes copied
				843	into the page).
Nick Piggin	4e02ed4	2008-10-29 14:00:55 -0700	[diff] [blame]	844
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	845	flags is a field for AOP_FLAG_xxx flags, described in
				846	include/linux/fs.h.
				847
Tobin C. Harding	1b44ae6	2019-05-15 10:29:12 +1000	[diff] [blame]	848	A void * may be returned in fsdata, which then gets passed into
				849	write_end.
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	850
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	851	Returns 0 on success; < 0 on failure (which is the error code),
				852	in which case write_end is not called.
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	853
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	854	``write_end``
				855	After a successful write_begin, and data copy, write_end must be
				856	called. len is the original len passed to write_begin, and
				857	copied is the amount that was able to be copied.
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	858
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	859	The filesystem must take care of unlocking the page and
				860	releasing it refcount, and updating i_size.
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	861
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	862	Returns < 0 on failure, otherwise the number of bytes (<=
				863	'copied') that were able to be copied into pagecache.
Nick Piggin	afddba4	2007-10-16 01:25:01 -0700	[diff] [blame]	864
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	865	``bmap``
				866	called by the VFS to map a logical block offset within object to
				867	physical block number. This method is used by the FIBMAP ioctl
				868	and for working with swap-files. To be able to swap to a file,
				869	the file must have a stable mapping to a block device. The swap
				870	system does not go through the filesystem but instead uses bmap
				871	to find out where the blocks in the file are and uses those
				872	addresses directly.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	873
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	874	``invalidatepage``
				875	If a page has PagePrivate set, then invalidatepage will be
				876	called when part or all of the page is to be removed from the
				877	address space. This generally corresponds to either a
				878	truncation, punch hole or a complete invalidation of the address
Lukas Czerner	d47992f	2013-05-21 23:17:23 -0400	[diff] [blame]	879	space (in the latter case 'offset' will always be 0 and 'length'
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	880	will be PAGE_SIZE). Any private data associated with the page
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	881	should be updated to reflect this truncation. If offset is 0
				882	and length is PAGE_SIZE, then the private data should be
				883	released, because the page must be able to be completely
				884	discarded. This may be done by calling the ->releasepage
				885	function, but in this case the release MUST succeed.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	886
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	887	``releasepage``
				888	releasepage is called on PagePrivate pages to indicate that the
				889	page should be freed if possible. ->releasepage should remove
				890	any private data from the page and clear the PagePrivate flag.
				891	If releasepage() fails for some reason, it must indicate failure
				892	with a 0 return value. releasepage() is used in two distinct
				893	though related cases. The first is when the VM finds a clean
				894	page with no active users and wants to make it a free page. If
				895	->releasepage succeeds, the page will be removed from the
				896	address_space and become free.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	897
Shaun Zinck	bc5b1d5	2007-10-20 02:35:36 +0200	[diff] [blame]	898	The second case is when a request has been made to invalidate
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	899	some or all pages in an address_space. This can happen through
				900	the fadvise(POSIX_FADV_DONTNEED) system call or by the
				901	filesystem explicitly requesting it as nfs and 9fs do (when they
				902	believe the cache may be out of date with storage) by calling
				903	invalidate_inode_pages2(). If the filesystem makes such a call,
				904	and needs to be certain that all pages are invalidated, then its
				905	releasepage will need to ensure this. Possibly it can clear the
				906	PageUptodate bit if it cannot free private data yet.
NeilBrown	341546f	2006-03-25 03:07:56 -0800	[diff] [blame]	907
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	908	``freepage``
				909	freepage is called once the page is no longer visible in the
				910	page cache in order to allow the cleanup of any private data.
				911	Since it may be called by the memory reclaimer, it should not
				912	assume that the original address_space mapping still exists, and
				913	it should not block.
Linus Torvalds	6072d13	2010-12-01 13:35:19 -0500	[diff] [blame]	914
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	915	``direct_IO``
				916	called by the generic read/write routines to perform direct_IO -
				917	that is IO requests which bypass the page cache and transfer
				918	data directly between the storage and the application's address
				919	space.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	920
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	921	``isolate_page``
				922	Called by the VM when isolating a movable non-lru page. If page
				923	is successfully isolated, VM marks the page as PG_isolated via
				924	__SetPageIsolated.
Minchan Kim	bda807d	2016-07-26 15:23:05 -0700	[diff] [blame]	925
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	926	``migrate_page``
				927	This is used to compact the physical memory usage. If the VM
				928	wants to relocate a page (maybe off a memory card that is
				929	signalling imminent failure) it will pass a new page and an old
				930	page to this function. migrate_page should transfer any private
				931	data across and update any references that it has to the page.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	932
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	933	``putback_page``
				934	Called by the VM when isolated page's migration fails.
Minchan Kim	bda807d	2016-07-26 15:23:05 -0700	[diff] [blame]	935
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	936	``launder_page``
				937	Called before freeing a page - it writes back the dirty page.
				938	To prevent redirtying the page, it is kept locked during the
				939	whole operation.
Borislav Petkov	422b14c	2007-07-15 23:41:43 -0700	[diff] [blame]	940
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	941	``is_partially_uptodate``
				942	Called by the VM when reading a file through the pagecache when
				943	the underlying blocksize != pagesize. If the required block is
				944	up to date then the read can complete without needing the IO to
				945	bring the whole page up to date.
Mel Gorman	26c0c5b	2013-07-03 15:04:45 -0700	[diff] [blame]	946
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	947	``is_dirty_writeback``
				948	Called by the VM when attempting to reclaim a page. The VM uses
				949	dirty and writeback information to determine if it needs to
				950	stall to allow flushers a chance to complete some IO.
				951	Ordinarily it can use PageDirty and PageWriteback but some
				952	filesystems have more complex state (unstable pages in NFS
				953	prevent reclaim) or do not set those flags due to locking
				954	problems. This callback allows a filesystem to indicate to the
				955	VM if a page should be treated as dirty or writeback for the
				956	purposes of stalling.
Mel Gorman	543cc11	2013-07-03 15:04:46 -0700	[diff] [blame]	957
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	958	``error_remove_page``
				959	normally set to generic_error_remove_page if truncation is ok
				960	for this address space. Used for memory failure handling.
Andi Kleen	2571873	2009-09-16 11:50:13 +0200	[diff] [blame]	961	Setting this implies you deal with pages going away under you,
				962	unless you have them locked or reference counts increased.
				963
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	964	``swap_activate``
				965	Called when swapon is used on a file to allocate space if
				966	necessary and pin the block lookup information in memory. A
				967	return value of zero indicates success, in which case this file
				968	can be used to back swapspace.
Mel Gorman	62c230b	2012-07-31 16:44:55 -0700	[diff] [blame]	969
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	970	``swap_deactivate``
				971	Called during swapoff on files where swap_activate was
				972	successful.
Mel Gorman	62c230b	2012-07-31 16:44:55 -0700	[diff] [blame]	973
Andi Kleen	2571873	2009-09-16 11:50:13 +0200	[diff] [blame]	974
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	975	The File Object
				976	===============
				977
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	978	A file object represents a file opened by a process. This is also known
Jeff Layton	acbf3c3	2017-07-06 07:02:27 -0400	[diff] [blame]	979	as an "open file description" in POSIX parlance.
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	980
				981
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	982	struct file_operations
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	983	----------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	984
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	985	This describes how the VFS can manipulate an open file. As of kernel
Amir Goldstein	17ef445	2018-08-27 15:56:01 +0300	[diff] [blame]	986	4.18, the following members are defined:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	987
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	988	.. code-block:: c
				989
				990	struct file_operations {
				991	struct module *owner;
				992	loff_t (llseek) (struct file , loff_t, int);
				993	ssize_t (read) (struct file , char __user , size_t, loff_t );
				994	ssize_t (write) (struct file , const char __user , size_t, loff_t );
				995	ssize_t (read_iter) (struct kiocb , struct iov_iter *);
				996	ssize_t (write_iter) (struct kiocb , struct iov_iter *);
				997	int (iopoll)(struct kiocb kiocb, bool spin);
				998	int (iterate) (struct file , struct dir_context *);
				999	int (iterate_shared) (struct file , struct dir_context *);
				1000	__poll_t (poll) (struct file , struct poll_table_struct *);
				1001	long (unlocked_ioctl) (struct file , unsigned int, unsigned long);
				1002	long (compat_ioctl) (struct file , unsigned int, unsigned long);
				1003	int (mmap) (struct file , struct vm_area_struct *);
				1004	int (open) (struct inode , struct file *);
				1005	int (flush) (struct file , fl_owner_t id);
				1006	int (release) (struct inode , struct file *);
				1007	int (fsync) (struct file , loff_t, loff_t, int datasync);
				1008	int (fasync) (int, struct file , int);
				1009	int (lock) (struct file , int, struct file_lock *);
				1010	ssize_t (sendpage) (struct file , struct page , int, size_t, loff_t , int);
				1011	unsigned long (get_unmapped_area)(struct file , unsigned long, unsigned long, unsigned long, unsigned long);
				1012	int (*check_flags)(int);
				1013	int (flock) (struct file , int, struct file_lock *);
				1014	ssize_t (splice_write)(struct pipe_inode_info , struct file , loff_t , size_t, unsigned int);
				1015	ssize_t (splice_read)(struct file , loff_t , struct pipe_inode_info , size_t, unsigned int);
				1016	int (setlease)(struct file , long, struct file_lock , void );
				1017	long (fallocate)(struct file file, int mode, loff_t offset,
				1018	loff_t len);
				1019	void (show_fdinfo)(struct seq_file m, struct file *f);
				1020	#ifndef CONFIG_MMU
				1021	unsigned (mmap_capabilities)(struct file );
				1022	#endif
				1023	ssize_t (copy_file_range)(struct file , loff_t, struct file *, loff_t, size_t, unsigned int);
				1024	loff_t (remap_file_range)(struct file file_in, loff_t pos_in,
				1025	struct file *file_out, loff_t pos_out,
				1026	loff_t len, unsigned int remap_flags);
				1027	int (fadvise)(struct file , loff_t, loff_t, int);
				1028	};
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1029
				1030	Again, all methods are called without any locks being held, unless
				1031	otherwise noted.
				1032
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1033	``llseek``
				1034	called when the VFS needs to move the file position index
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1035
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1036	``read``
				1037	called by read(2) and related system calls
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1038
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1039	``read_iter``
				1040	possibly asynchronous read with iov_iter as destination
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1041
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1042	``write``
				1043	called by write(2) and related system calls
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1044
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1045	``write_iter``
				1046	possibly asynchronous write with iov_iter as source
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1047
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1048	``iopoll``
				1049	called when aio wants to poll for completions on HIPRI iocbs
Christoph Hellwig	fb7e160	2018-11-22 16:37:38 +0100	[diff] [blame]	1050
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1051	``iterate``
				1052	called when the VFS needs to read the directory contents
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1053
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1054	``iterate_shared``
				1055	called when the VFS needs to read the directory contents when
				1056	filesystem supports concurrent dir iterators
Amir Goldstein	17ef445	2018-08-27 15:56:01 +0300	[diff] [blame]	1057
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1058	``poll``
				1059	called by the VFS when a process wants to check if there is
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1060	activity on this file and (optionally) go to sleep until there
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	1061	is activity. Called by the select(2) and poll(2) system calls
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1062
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1063	``unlocked_ioctl``
				1064	called by the ioctl(2) system call.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1065
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1066	``compat_ioctl``
				1067	called by the ioctl(2) system call when 32 bit system calls are
				1068	used on 64 bit kernels.
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1069
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1070	``mmap``
				1071	called by the mmap(2) system call
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1072
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1073	``open``
				1074	called by the VFS when an inode should be opened. When the VFS
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	1075	opens a file, it creates a new "struct file". It then calls the
				1076	open method for the newly allocated file structure. You might
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1077	think that the open method really belongs in "struct
				1078	inode_operations", and you may be right. I think it's done the
				1079	way it is because it makes filesystems simpler to implement.
				1080	The open() method is a good place to initialize the
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1081	"private_data" member in the file structure if you want to point
				1082	to a device structure
				1083
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1084	``flush``
				1085	called by the close(2) system call to flush a file
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1086
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1087	``release``
				1088	called when the last reference to an open file is closed
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1089
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1090	``fsync``
				1091	called by the fsync(2) system call. Also see the section above
				1092	entitled "Handling errors during writeback".
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1093
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1094	``fasync``
				1095	called by the fcntl(2) system call when asynchronous
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1096	(non-blocking) mode is enabled for a file
				1097
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1098	``lock``
				1099	called by the fcntl(2) system call for F_GETLK, F_SETLK, and
				1100	F_SETLKW commands
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1101
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1102	``get_unmapped_area``
				1103	called by the mmap(2) system call
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1104
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1105	``check_flags``
				1106	called by the fcntl(2) system call for F_SETFL command
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1107
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1108	``flock``
				1109	called by the flock(2) system call
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1110
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1111	``splice_write``
				1112	called by the VFS to splice data from a pipe to a file. This
				1113	method is used by the splice(2) system call
Pekka J Enberg	d1195c5	2006-04-11 14:21:59 +0200	[diff] [blame]	1114
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1115	``splice_read``
				1116	called by the VFS to splice data from file to a pipe. This
				1117	method is used by the splice(2) system call
Pekka J Enberg	d1195c5	2006-04-11 14:21:59 +0200	[diff] [blame]	1118
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1119	``setlease``
				1120	called by the VFS to set or release a file lock lease. setlease
				1121	implementations should call generic_setlease to record or remove
				1122	the lease in the inode after setting it.
Hugh Dickins	17cf28a	2012-05-29 15:06:41 -0700	[diff] [blame]	1123
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1124	``fallocate``
				1125	called by the VFS to preallocate blocks or punch a hole.
Hugh Dickins	17cf28a	2012-05-29 15:06:41 -0700	[diff] [blame]	1126
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1127	``copy_file_range``
				1128	called by the copy_file_range(2) system call.
Amir Goldstein	17ef445	2018-08-27 15:56:01 +0300	[diff] [blame]	1129
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1130	``remap_file_range``
				1131	called by the ioctl(2) system call for FICLONERANGE and FICLONE
				1132	and FIDEDUPERANGE commands to remap file ranges. An
				1133	implementation should remap len bytes at pos_in of the source
				1134	file into the dest file at pos_out. Implementations must handle
				1135	callers passing in len == 0; this means "remap to the end of the
				1136	source file". The return value should the number of bytes
				1137	remapped, or the usual negative error code if errors occurred
				1138	before any bytes were remapped. The remap_flags parameter
				1139	accepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then the
				1140	implementation must only remap if the requested file ranges have
Julia Lawall	cb56eca	2020-07-26 20:43:40 +0200	[diff] [blame]	1141	identical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller is
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1142	ok with the implementation shortening the request length to
				1143	satisfy alignment or EOF requirements (or any other reason).
Amir Goldstein	17ef445	2018-08-27 15:56:01 +0300	[diff] [blame]	1144
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1145	``fadvise``
				1146	possibly called by the fadvise64() system call.
Amir Goldstein	45cd0fa	2018-08-27 15:56:02 +0300	[diff] [blame]	1147
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1148	Note that the file operations are implemented by the specific
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	1149	filesystem in which the inode resides. When opening a device node
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1150	(character or block special) most filesystems will call special
				1151	support routines in the VFS which will locate the required device
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	1152	driver information. These support routines replace the filesystem file
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1153	operations with those for the device driver, and then proceed to call
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	1154	the new open() method for the file. This is how opening a device file
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1155	in the filesystem eventually ends up calling the device driver open()
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1156	method.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1157
				1158
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1159	Directory Entry Cache (dcache)
				1160	==============================
				1161
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1162
				1163	struct dentry_operations
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1164	------------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1165
				1166	This describes how a filesystem can overload the standard dentry
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	1167	operations. Dentries and the dcache are the domain of the VFS and the
				1168	individual filesystem implementations. Device drivers have no business
				1169	here. These methods may be set to NULL, as they are either optional or
				1170	the VFS uses a default. As of kernel 2.6.22, the following members are
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1171	defined:
				1172
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	1173	.. code-block:: c
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1174
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	1175	struct dentry_operations {
				1176	int (d_revalidate)(struct dentry , unsigned int);
				1177	int (d_weak_revalidate)(struct dentry , unsigned int);
				1178	int (d_hash)(const struct dentry , struct qstr *);
				1179	int (d_compare)(const struct dentry ,
				1180	unsigned int, const char , const struct qstr );
				1181	int (d_delete)(const struct dentry );
				1182	int (d_init)(struct dentry );
				1183	void (d_release)(struct dentry );
				1184	void (d_iput)(struct dentry , struct inode *);
				1185	char (d_dname)(struct dentry , char , int);
				1186	struct vfsmount (d_automount)(struct path *);
				1187	int (d_manage)(const struct path , bool);
				1188	struct dentry (d_real)(struct dentry , const struct inode );
				1189	};
				1190
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1191	``d_revalidate``
				1192	called when the VFS needs to revalidate a dentry. This is
				1193	called whenever a name look-up finds a dentry in the dcache.
				1194	Most local filesystems leave this as NULL, because all their
				1195	dentries in the dcache are valid. Network filesystems are
				1196	different since things can change on the server without the
				1197	client necessarily being aware of it.
Jeff Layton	ecf3d1f	2013-02-20 11:19:05 -0500	[diff] [blame]	1198
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1199	This function should return a positive value if the dentry is
				1200	still valid, and zero or a negative error code if it isn't.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1201
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1202	d_revalidate may be called in rcu-walk mode (flags &
				1203	LOOKUP_RCU). If in rcu-walk mode, the filesystem must
				1204	revalidate the dentry without blocking or storing to the dentry,
				1205	d_parent and d_inode should not be used without care (because
				1206	they can change and, in d_inode case, even become NULL under
				1207	us).
Nick Piggin	34286d6	2011-01-07 17:49:57 +1100	[diff] [blame]	1208
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1209	If a situation is encountered that rcu-walk cannot handle,
				1210	return
Nick Piggin	34286d6	2011-01-07 17:49:57 +1100	[diff] [blame]	1211	-ECHILD and it will be called again in ref-walk mode.
				1212
Glenn Washburn	29cb0f6	2023-02-27 12:40:42 -0600	[diff] [blame^]	1213	``d_weak_revalidate``
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1214	called when the VFS needs to revalidate a "jumped" dentry. This
				1215	is called when a path-walk ends at dentry that was not acquired
				1216	by doing a lookup in the parent directory. This includes "/",
				1217	"." and "..", as well as procfs-style symlinks and mountpoint
				1218	traversal.
Jeff Layton	ecf3d1f	2013-02-20 11:19:05 -0500	[diff] [blame]	1219
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1220	In this case, we are less concerned with whether the dentry is
				1221	still fully correct, but rather that the inode is still valid.
				1222	As with d_revalidate, most local filesystems will set this to
				1223	NULL since their dcache entries are always valid.
Jeff Layton	ecf3d1f	2013-02-20 11:19:05 -0500	[diff] [blame]	1224
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1225	This function has the same return code semantics as
				1226	d_revalidate.
Jeff Layton	ecf3d1f	2013-02-20 11:19:05 -0500	[diff] [blame]	1227
				1228	d_weak_revalidate is only called after leaving rcu-walk mode.
				1229
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1230	``d_hash``
				1231	called when the VFS adds a dentry to the hash table. The first
Nick Piggin	621e155	2011-01-07 17:49:27 +1100	[diff] [blame]	1232	dentry passed to d_hash is the parent directory that the name is
Linus Torvalds	da53be1	2013-05-21 15:22:44 -0700	[diff] [blame]	1233	to be hashed into.
Nick Piggin	b1e6a01	2011-01-07 17:49:28 +1100	[diff] [blame]	1234
				1235	Same locking and synchronisation rules as d_compare regarding
				1236	what is safe to dereference etc.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1237
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1238	``d_compare``
				1239	called to compare a dentry name with a given name. The first
Nick Piggin	621e155	2011-01-07 17:49:27 +1100	[diff] [blame]	1240	dentry is the parent of the dentry to be compared, the second is
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1241	the child dentry. len and name string are properties of the
				1242	dentry to be compared. qstr is the name to compare it with.
Nick Piggin	621e155	2011-01-07 17:49:27 +1100	[diff] [blame]	1243
				1244	Must be constant and idempotent, and should not take locks if
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1245	possible, and should not or store into the dentry. Should not
				1246	dereference pointers outside the dentry without lots of care
				1247	(eg. d_parent, d_inode, d_name should not be used).
Nick Piggin	621e155	2011-01-07 17:49:27 +1100	[diff] [blame]	1248
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1249	However, our vfsmount is pinned, and RCU held, so the dentries
				1250	and inodes won't disappear, neither will our sb or filesystem
				1251	module. ->d_sb may be used.
Nick Piggin	621e155	2011-01-07 17:49:27 +1100	[diff] [blame]	1252
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1253	It is a tricky calling convention because it needs to be called
				1254	under "rcu-walk", ie. without any locks or references on things.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1255
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1256	``d_delete``
				1257	called when the last reference to a dentry is dropped and the
				1258	dcache is deciding whether or not to cache it. Return 1 to
				1259	delete immediately, or 0 to cache the dentry. Default is NULL
				1260	which means to always cache a reachable dentry. d_delete must
				1261	be constant and idempotent.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1262
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1263	``d_init``
				1264	called when a dentry is allocated
Miklos Szeredi	285b102	2016-06-28 11:47:32 +0200	[diff] [blame]	1265
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1266	``d_release``
				1267	called when a dentry is really deallocated
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1268
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1269	``d_iput``
				1270	called when a dentry loses its inode (just prior to its being
				1271	deallocated). The default when this is NULL is that the VFS
				1272	calls iput(). If you define this method, you must call iput()
				1273	yourself
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1274
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1275	``d_dname``
				1276	called when the pathname of a dentry should be generated.
				1277	Useful for some pseudo filesystems (sockfs, pipefs, ...) to
				1278	delay pathname generation. (Instead of doing it when dentry is
				1279	created, it's done only when the path is needed.). Real
				1280	filesystems probably dont want to use it, because their dentries
				1281	are present in global dcache hash, so their hash should be an
				1282	invariant. As no lock is held, d_dname() should not try to
				1283	modify the dentry itself, unless appropriate SMP safety is used.
				1284	CAUTION : d_path() logic is quite tricky. The correct way to
				1285	return for example "Hello" is to put it at the end of the
				1286	buffer, and returns a pointer to the first char.
				1287	dynamic_dname() helper function is provided to take care of
				1288	this.
Eric Dumazet	c23fbb6	2007-05-08 00:26:18 -0700	[diff] [blame]	1289
Miklos Szeredi	0cac643	2016-06-30 08:53:28 +0200	[diff] [blame]	1290	Example :
				1291
Tobin C. Harding	af96c1e3	2019-05-15 10:29:13 +1000	[diff] [blame]	1292	.. code-block:: c
				1293
Miklos Szeredi	0cac643	2016-06-30 08:53:28 +0200	[diff] [blame]	1294	static char pipefs_dname(struct dentry dent, char *buffer, int buflen)
				1295	{
				1296	return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]",
				1297	dentry->d_inode->i_ino);
				1298	}
				1299
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1300	``d_automount``
				1301	called when an automount dentry is to be traversed (optional).
				1302	This should create a new VFS mount record and return the record
				1303	to the caller. The caller is supplied with a path parameter
				1304	giving the automount directory to describe the automount target
				1305	and the parent VFS mount record to provide inheritable mount
				1306	parameters. NULL should be returned if someone else managed to
				1307	make the automount first. If the vfsmount creation failed, then
				1308	an error code should be returned. If -EISDIR is returned, then
				1309	the directory will be treated as an ordinary directory and
				1310	returned to pathwalk to continue walking.
David Howells	ea5b778	2011-01-14 19:10:03 +0000	[diff] [blame]	1311
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1312	If a vfsmount is returned, the caller will attempt to mount it
				1313	on the mountpoint and will remove the vfsmount from its
				1314	expiration list in the case of failure. The vfsmount should be
				1315	returned with 2 refs on it to prevent automatic expiration - the
				1316	caller will clean up the additional ref.
David Howells	9875cf8	2011-01-14 18:45:21 +0000	[diff] [blame]	1317
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1318	This function is only used if DCACHE_NEED_AUTOMOUNT is set on
				1319	the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is
				1320	set on the inode being added.
David Howells	9875cf8	2011-01-14 18:45:21 +0000	[diff] [blame]	1321
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1322	``d_manage``
				1323	called to allow the filesystem to manage the transition from a
				1324	dentry (optional). This allows autofs, for example, to hold up
				1325	clients waiting to explore behind a 'mountpoint' while letting
				1326	the daemon go past and construct the subtree there. 0 should be
				1327	returned to let the calling process continue. -EISDIR can be
				1328	returned to tell pathwalk to use this directory as an ordinary
				1329	directory and to ignore anything mounted on it and not to check
				1330	the automount flag. Any other error code will abort pathwalk
				1331	completely.
David Howells	cc53ce5	2011-01-14 18:45:26 +0000	[diff] [blame]	1332
David Howells	ab90911	2011-01-14 18:46:51 +0000	[diff] [blame]	1333	If the 'rcu_walk' parameter is true, then the caller is doing a
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1334	pathwalk in RCU-walk mode. Sleeping is not permitted in this
				1335	mode, and the caller can be asked to leave it and call again by
				1336	returning -ECHILD. -EISDIR may also be returned to tell
				1337	pathwalk to ignore d_automount or any mounts.
David Howells	ab90911	2011-01-14 18:46:51 +0000	[diff] [blame]	1338
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1339	This function is only used if DCACHE_MANAGE_TRANSIT is set on
				1340	the dentry being transited from.
David Howells	cc53ce5	2011-01-14 18:45:26 +0000	[diff] [blame]	1341
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1342	``d_real``
				1343	overlay/union type filesystems implement this method to return
				1344	one of the underlying dentries hidden by the overlay. It is
				1345	used in two different modes:
Eric Dumazet	c23fbb6	2007-05-08 00:26:18 -0700	[diff] [blame]	1346
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1347	Called from file_dentry() it returns the real dentry matching
				1348	the inode argument. The real dentry may be from a lower layer
				1349	already copied up, but still referenced from the file. This
				1350	mode is selected with a non-NULL inode argument.
Miklos Szeredi	e698b8a	2016-06-30 08:53:27 +0200	[diff] [blame]	1351
Miklos Szeredi	fb16043	2018-07-18 15:44:44 +0200	[diff] [blame]	1352	With NULL inode the topmost real underlying dentry is returned.
Eric Dumazet	c23fbb6	2007-05-08 00:26:18 -0700	[diff] [blame]	1353
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1354	Each dentry has a pointer to its parent dentry, as well as a hash list
Tobin C. Harding	4ee33ea	2019-05-15 10:29:06 +1000	[diff] [blame]	1355	of child dentries. Child dentries are basically like files in a
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1356	directory.
				1357
Pekka J Enberg	5ea626a	2005-09-09 13:10:19 -0700	[diff] [blame]	1358
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	1359	Directory Entry Cache API
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1360	--------------------------
				1361
				1362	There are a number of functions defined which permit a filesystem to
				1363	manipulate dentries:
				1364
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1365	``dget``
				1366	open a new handle for an existing dentry (this just increments
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1367	the usage count)
				1368
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1369	``dput``
				1370	close a handle for a dentry (decrements the usage count). If
Nick Piggin	fe15ce4	2011-01-07 17:49:23 +1100	[diff] [blame]	1371	the usage count drops to 0, and the dentry is still in its
				1372	parent's hash, the "d_delete" method is called to check whether
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1373	it should be cached. If it should not be cached, or if the
				1374	dentry is not hashed, it is deleted. Otherwise cached dentries
				1375	are put into an LRU list to be reclaimed on memory shortage.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1376
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1377	``d_drop``
				1378	this unhashes a dentry from its parents hash list. A subsequent
				1379	call to dput() will deallocate the dentry if its usage count
				1380	drops to 0
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1381
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1382	``d_delete``
				1383	delete a dentry. If there are no other open references to the
				1384	dentry then the dentry is turned into a negative dentry (the
				1385	d_iput() method is called). If there are other references, then
				1386	d_drop() is called instead
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1387
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1388	``d_add``
				1389	add a dentry to its parents hash list and then calls
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1390	d_instantiate()
				1391
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1392	``d_instantiate``
				1393	add a dentry to the alias hash list for the inode and updates
				1394	the "d_inode" member. The "i_count" member in the inode
				1395	structure should be set/incremented. If the inode pointer is
				1396	NULL, the dentry is called a "negative dentry". This function
				1397	is commonly called when an inode is created for an existing
				1398	negative dentry
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1399
Tobin C. Harding	ee5dc04	2019-06-04 10:26:56 +1000	[diff] [blame]	1400	``d_lookup``
				1401	look up a dentry given its parent and path name component It
				1402	looks up the child of that given name from the dcache hash
				1403	table. If it is found, the reference count is incremented and
				1404	the dentry is returned. The caller must use dput() to free the
				1405	dentry when it finishes using it.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1406
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	1407
Miklos Szeredi	f84e3f5	2008-02-08 04:21:34 -0800	[diff] [blame]	1408	Mount Options
				1409	=============
				1410
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	1411
Miklos Szeredi	f84e3f5	2008-02-08 04:21:34 -0800	[diff] [blame]	1412	Parsing options
				1413	---------------
				1414
				1415	On mount and remount the filesystem is passed a string containing a
				1416	comma separated list of mount options. The options can have either of
				1417	these forms:
				1418
				1419	option
				1420	option=value
				1421
				1422	The <linux/parser.h> header defines an API that helps parse these
				1423	options. There are plenty of examples on how to use it in existing
				1424	filesystems.
				1425
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	1426
Miklos Szeredi	f84e3f5	2008-02-08 04:21:34 -0800	[diff] [blame]	1427	Showing options
				1428	---------------
				1429
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	1430	If a filesystem accepts mount options, it must define show_options() to
				1431	show all the currently active options. The rules are:
Miklos Szeredi	f84e3f5	2008-02-08 04:21:34 -0800	[diff] [blame]	1432
				1433	- options MUST be shown which are not default or their values differ
				1434	from the default
				1435
				1436	- options MAY be shown which are enabled by default or have their
				1437	default value
				1438
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	1439	Options used only internally between a mount helper and the kernel (such
				1440	as file descriptors), or which only have an effect during the mounting
				1441	(such as ones controlling the creation of a journal) are exempt from the
				1442	above rules.
Miklos Szeredi	f84e3f5	2008-02-08 04:21:34 -0800	[diff] [blame]	1443
Tobin C. Harding	90caa78	2019-05-15 10:29:07 +1000	[diff] [blame]	1444	The underlying reason for the above rules is to make sure, that a mount
				1445	can be accurately replicated (e.g. umounting and mounting again) based
				1446	on the information found in /proc/mounts.
Miklos Szeredi	f84e3f5	2008-02-08 04:21:34 -0800	[diff] [blame]	1447
Tobin C. Harding	e04c83c	2019-05-15 10:29:08 +1000	[diff] [blame]	1448
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	1449	Resources
				1450	=========
				1451
				1452	(Note some of these resources are not up-to-date with the latest kernel
				1453	version.)
				1454
				1455	Creating Linux virtual filesystems. 2002
Alexander A. Klimov	c69f22f	2020-06-21 15:35:52 +0200	[diff] [blame]	1456	<https://lwn.net/Articles/13325/>
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	1457
				1458	The Linux Virtual File-system Layer by Neil Brown. 1999
				1459	<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
				1460
				1461	A tour of the Linux VFS by Michael K. Johnson. 1996
Alexander A. Klimov	c69f22f	2020-06-21 15:35:52 +0200	[diff] [blame]	1462	<https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
Pekka Enberg	cc7d1f8	2005-11-07 01:01:08 -0800	[diff] [blame]	1463
				1464	A small trail through the Linux kernel by Andries Brouwer. 2001
Alexander A. Klimov	c69f22f	2020-06-21 15:35:52 +0200	[diff] [blame]	1465	<https://www.win.tue.nl/~aeb/linux/vfs/trail.html>