blob: 3bc82e9d22f0bc53483bfd39f34d98bcd6a23f4a [file] [log] [blame]
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -03001=====================
2Linux Filesystems API
3=====================
4
5The Linux VFS
6=============
7
8The Filesystem types
9--------------------
10
11.. kernel-doc:: include/linux/fs.h
12 :internal:
13
14The Directory Cache
15-------------------
16
17.. kernel-doc:: fs/dcache.c
18 :export:
19
20.. kernel-doc:: include/linux/dcache.h
21 :internal:
22
23Inode Handling
24--------------
25
26.. kernel-doc:: fs/inode.c
27 :export:
28
29.. kernel-doc:: fs/bad_inode.c
30 :export:
31
32Registration and Superblocks
33----------------------------
34
35.. kernel-doc:: fs/super.c
36 :export:
37
38File Locks
39----------
40
41.. kernel-doc:: fs/locks.c
42 :export:
43
44.. kernel-doc:: fs/locks.c
45 :internal:
46
47Other Functions
48---------------
49
50.. kernel-doc:: fs/mpage.c
51 :export:
52
53.. kernel-doc:: fs/namei.c
54 :export:
55
56.. kernel-doc:: fs/buffer.c
57 :export:
58
59.. kernel-doc:: block/bio.c
60 :export:
61
62.. kernel-doc:: fs/seq_file.c
63 :export:
64
65.. kernel-doc:: fs/filesystems.c
66 :export:
67
68.. kernel-doc:: fs/fs-writeback.c
69 :export:
70
71.. kernel-doc:: fs/block_dev.c
72 :export:
73
74The proc filesystem
75===================
76
77sysctl interface
78----------------
79
80.. kernel-doc:: kernel/sysctl.c
81 :export:
82
83proc filesystem interface
84-------------------------
85
86.. kernel-doc:: fs/proc/base.c
87 :internal:
88
89Events based on file descriptors
90================================
91
92.. kernel-doc:: fs/eventfd.c
93 :export:
94
95The Filesystem for Exporting Kernel Objects
96===========================================
97
98.. kernel-doc:: fs/sysfs/file.c
99 :export:
100
101.. kernel-doc:: fs/sysfs/symlink.c
102 :export:
103
104The debugfs filesystem
105======================
106
107debugfs interface
108-----------------
109
110.. kernel-doc:: fs/debugfs/inode.c
111 :export:
112
113.. kernel-doc:: fs/debugfs/file.c
114 :export:
115
116The Linux Journalling API
117=========================
118
119Overview
120--------
121
122Details
123~~~~~~~
124
125The journalling layer is easy to use. You need to first of all create a
126journal_t data structure. There are two calls to do this dependent on
127how you decide to allocate the physical media on which the journal
128resides. The jbd2_journal_init_inode() call is for journals stored in
129filesystem inodes, or the jbd2_journal_init_dev() call can be used
130for journal stored on a raw device (in a continuous range of blocks). A
131journal_t is a typedef for a struct pointer, so when you are finally
132finished make sure you call jbd2_journal_destroy() on it to free up
133any used kernel memory.
134
135Once you have got your journal_t object you need to 'mount' or load the
136journal file. The journalling layer expects the space for the journal
137was already allocated and initialized properly by the userspace tools.
138When loading the journal you must call jbd2_journal_load() to process
139journal contents. If the client file system detects the journal contents
140does not need to be processed (or even need not have valid contents), it
141may call jbd2_journal_wipe() to clear the journal contents before
142calling jbd2_journal_load().
143
144Note that jbd2_journal_wipe(..,0) calls
145jbd2_journal_skip_recovery() for you if it detects any outstanding
146transactions in the journal and similarly jbd2_journal_load() will
147call jbd2_journal_recover() if necessary. I would advise reading
148ext4_load_journal() in fs/ext4/super.c for examples on this stage.
149
150Now you can go ahead and start modifying the underlying filesystem.
151Almost.
152
153You still need to actually journal your filesystem changes, this is done
154by wrapping them into transactions. Additionally you also need to wrap
155the modification of each of the buffers with calls to the journal layer,
156so it knows what the modifications you are actually making are. To do
157this use jbd2_journal_start() which returns a transaction handle.
158
159jbd2_journal_start() and its counterpart jbd2_journal_stop(), which
160indicates the end of a transaction are nestable calls, so you can
161reenter a transaction if necessary, but remember you must call
162jbd2_journal_stop() the same number of times as jbd2_journal_start()
163before the transaction is completed (or more accurately leaves the
164update phase). Ext4/VFS makes use of this feature to simplify handling
165of inode dirtying, quota support, etc.
166
167Inside each transaction you need to wrap the modifications to the
168individual buffers (blocks). Before you start to modify a buffer you
169need to call jbd2_journal_get_{create,write,undo}_access() as
170appropriate, this allows the journalling layer to copy the unmodified
171data if it needs to. After all the buffer may be part of a previously
172uncommitted transaction. At this point you are at last ready to modify a
173buffer, and once you are have done so you need to call
174jbd2_journal_dirty_{meta,}data(). Or if you've asked for access to a
175buffer you now know is now longer required to be pushed back on the
176device you can call jbd2_journal_forget() in much the same way as you
177might have used bforget() in the past.
178
179A jbd2_journal_flush() may be called at any time to commit and
180checkpoint all your transactions.
181
182Then at umount time , in your put_super() you can then call
183jbd2_journal_destroy() to clean up your in-core journal object.
184
185Unfortunately there a couple of ways the journal layer can cause a
186deadlock. The first thing to note is that each task can only have a
187single outstanding transaction at any one time, remember nothing commits
188until the outermost jbd2_journal_stop(). This means you must complete
189the transaction at the end of each file/inode/address etc. operation you
190perform, so that the journalling system isn't re-entered on another
191journal. Since transactions can't be nested/batched across differing
192journals, and another filesystem other than yours (say ext4) may be
193modified in a later syscall.
194
195The second case to bear in mind is that jbd2_journal_start() can block
196if there isn't enough space in the journal for your transaction (based
197on the passed nblocks param) - when it blocks it merely(!) needs to wait
198for transactions to complete and be committed from other tasks, so
199essentially we are waiting for jbd2_journal_stop(). So to avoid
200deadlocks you must treat jbd2_journal_start/stop() as if they were
201semaphores and include them in your semaphore ordering rules to prevent
202deadlocks. Note that jbd2_journal_extend() has similar blocking
203behaviour to jbd2_journal_start() so you can deadlock here just as
204easily as on jbd2_journal_start().
205
206Try to reserve the right number of blocks the first time. ;-). This will
207be the maximum number of blocks you are going to touch in this
208transaction. I advise having a look at at least ext4_jbd.h to see the
209basis on which ext4 uses to make these decisions.
210
211Another wriggle to watch out for is your on-disk block allocation
212strategy. Why? Because, if you do a delete, you need to ensure you
213haven't reused any of the freed blocks until the transaction freeing
214these blocks commits. If you reused these blocks and crash happens,
215there is no way to restore the contents of the reallocated blocks at the
216end of the last fully committed transaction. One simple way of doing
217this is to mark blocks as free in internal in-memory block allocation
218structures only after the transaction freeing them commits. Ext4 uses
219journal commit callback for this purpose.
220
221With journal commit callbacks you can ask the journalling layer to call
222a callback function when the transaction is finally committed to disk,
223so that you can do some of your own management. You ask the journalling
224layer for calling the callback by simply setting
225journal->j_commit_callback function pointer and that function is
226called after each transaction commit. You can also use
227transaction->t_private_list for attaching entries to a transaction
228that need processing when the transaction commits.
229
230JBD2 also provides a way to block all transaction updates via
231jbd2_journal_{un,}lock_updates(). Ext4 uses this when it wants a
232window with a clean and stable fs for a moment. E.g.
233
234::
235
236
237 jbd2_journal_lock_updates() //stop new stuff happening..
238 jbd2_journal_flush() // checkpoint everything.
239 ..do stuff on stable fs
240 jbd2_journal_unlock_updates() // carry on with filesystem use.
241
242The opportunities for abuse and DOS attacks with this should be obvious,
243if you allow unprivileged userspace to trigger codepaths containing
244these calls.
245
246Summary
247~~~~~~~
248
249Using the journal is a matter of wrapping the different context changes,
250being each mount, each modification (transaction) and each changed
251buffer to tell the journalling layer about them.
252
253Data Types
254----------
255
256The journalling layer uses typedefs to 'hide' the concrete definitions
257of the structures used. As a client of the JBD2 layer you can just rely
258on the using the pointer as a magic cookie of some sort. Obviously the
259hiding is not enforced as this is 'C'.
260
261Structures
262~~~~~~~~~~
263
264.. kernel-doc:: include/linux/jbd2.h
265 :internal:
266
267Functions
268---------
269
270The functions here are split into two groups those that affect a journal
271as a whole, and those which are used to manage transactions
272
273Journal Level
274~~~~~~~~~~~~~
275
276.. kernel-doc:: fs/jbd2/journal.c
277 :export:
278
279.. kernel-doc:: fs/jbd2/recovery.c
280 :internal:
281
282Transasction Level
283~~~~~~~~~~~~~~~~~~
284
285.. kernel-doc:: fs/jbd2/transaction.c
286 :export:
287
288See also
289--------
290
291`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
292Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
293
294`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
295Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__
296
297splice API
298==========
299
300splice is a method for moving blocks of data around inside the kernel,
301without continually transferring them between the kernel and user space.
302
303.. kernel-doc:: fs/splice.c
304
305pipes API
306=========
307
308Pipe interfaces are all for in-kernel (builtin image) use. They are not
309exported for use by modules.
310
311.. kernel-doc:: include/linux/pipe_fs_i.h
312 :internal:
313
314.. kernel-doc:: fs/pipe.c