David Howells | fb28afc | 2021-02-22 13:17:24 +0000 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ================================= |
| 4 | NETWORK FILESYSTEM HELPER LIBRARY |
| 5 | ================================= |
| 6 | |
| 7 | .. Contents: |
| 8 | |
| 9 | - Overview. |
| 10 | - Buffered read helpers. |
| 11 | - Read helper functions. |
| 12 | - Read helper structures. |
| 13 | - Read helper operations. |
| 14 | - Read helper procedure. |
| 15 | - Read helper cache API. |
| 16 | |
| 17 | |
| 18 | Overview |
| 19 | ======== |
| 20 | |
| 21 | The network filesystem helper library is a set of functions designed to aid a |
| 22 | network filesystem in implementing VM/VFS operations. For the moment, that |
| 23 | just includes turning various VM buffered read operations into requests to read |
| 24 | from the server. The helper library, however, can also interpose other |
| 25 | services, such as local caching or local data encryption. |
| 26 | |
| 27 | Note that the library module doesn't link against local caching directly, so |
| 28 | access must be provided by the netfs. |
| 29 | |
| 30 | |
| 31 | Buffered Read Helpers |
| 32 | ===================== |
| 33 | |
| 34 | The library provides a set of read helpers that handle the ->readpage(), |
| 35 | ->readahead() and much of the ->write_begin() VM operations and translate them |
| 36 | into a common call framework. |
| 37 | |
| 38 | The following services are provided: |
| 39 | |
| 40 | * Handles transparent huge pages (THPs). |
| 41 | |
| 42 | * Insulates the netfs from VM interface changes. |
| 43 | |
| 44 | * Allows the netfs to arbitrarily split reads up into pieces, even ones that |
| 45 | don't match page sizes or page alignments and that may cross pages. |
| 46 | |
| 47 | * Allows the netfs to expand a readahead request in both directions to meet |
| 48 | its needs. |
| 49 | |
| 50 | * Allows the netfs to partially fulfil a read, which will then be resubmitted. |
| 51 | |
| 52 | * Handles local caching, allowing cached data and server-read data to be |
| 53 | interleaved for a single request. |
| 54 | |
| 55 | * Handles clearing of bufferage that aren't on the server. |
| 56 | |
| 57 | * Handle retrying of reads that failed, switching reads from the cache to the |
| 58 | server as necessary. |
| 59 | |
| 60 | * In the future, this is a place that other services can be performed, such as |
| 61 | local encryption of data to be stored remotely or in the cache. |
| 62 | |
| 63 | From the network filesystem, the helpers require a table of operations. This |
| 64 | includes a mandatory method to issue a read operation along with a number of |
| 65 | optional methods. |
| 66 | |
| 67 | |
| 68 | Read Helper Functions |
| 69 | --------------------- |
| 70 | |
| 71 | Three read helpers are provided:: |
| 72 | |
| 73 | * void netfs_readahead(struct readahead_control *ractl, |
| 74 | const struct netfs_read_request_ops *ops, |
| 75 | void *netfs_priv);`` |
| 76 | * int netfs_readpage(struct file *file, |
| 77 | struct page *page, |
| 78 | const struct netfs_read_request_ops *ops, |
| 79 | void *netfs_priv); |
| 80 | * int netfs_write_begin(struct file *file, |
| 81 | struct address_space *mapping, |
| 82 | loff_t pos, |
| 83 | unsigned int len, |
| 84 | unsigned int flags, |
| 85 | struct page **_page, |
| 86 | void **_fsdata, |
| 87 | const struct netfs_read_request_ops *ops, |
| 88 | void *netfs_priv); |
| 89 | |
| 90 | Each corresponds to a VM operation, with the addition of a couple of parameters |
| 91 | for the use of the read helpers: |
| 92 | |
| 93 | * ``ops`` |
| 94 | |
| 95 | A table of operations through which the helpers can talk to the filesystem. |
| 96 | |
| 97 | * ``netfs_priv`` |
| 98 | |
| 99 | Filesystem private data (can be NULL). |
| 100 | |
| 101 | Both of these values will be stored into the read request structure. |
| 102 | |
| 103 | For ->readahead() and ->readpage(), the network filesystem should just jump |
| 104 | into the corresponding read helper; whereas for ->write_begin(), it may be a |
| 105 | little more complicated as the network filesystem might want to flush |
| 106 | conflicting writes or track dirty data and needs to put the acquired page if an |
| 107 | error occurs after calling the helper. |
| 108 | |
| 109 | The helpers manage the read request, calling back into the network filesystem |
| 110 | through the suppplied table of operations. Waits will be performed as |
| 111 | necessary before returning for helpers that are meant to be synchronous. |
| 112 | |
| 113 | If an error occurs and netfs_priv is non-NULL, ops->cleanup() will be called to |
| 114 | deal with it. If some parts of the request are in progress when an error |
| 115 | occurs, the request will get partially completed if sufficient data is read. |
| 116 | |
| 117 | Additionally, there is:: |
| 118 | |
| 119 | * void netfs_subreq_terminated(struct netfs_read_subrequest *subreq, |
| 120 | ssize_t transferred_or_error, |
| 121 | bool was_async); |
| 122 | |
| 123 | which should be called to complete a read subrequest. This is given the number |
| 124 | of bytes transferred or a negative error code, plus a flag indicating whether |
| 125 | the operation was asynchronous (ie. whether the follow-on processing can be |
| 126 | done in the current context, given this may involve sleeping). |
| 127 | |
| 128 | |
| 129 | Read Helper Structures |
| 130 | ---------------------- |
| 131 | |
| 132 | The read helpers make use of a couple of structures to maintain the state of |
| 133 | the read. The first is a structure that manages a read request as a whole:: |
| 134 | |
| 135 | struct netfs_read_request { |
| 136 | struct inode *inode; |
| 137 | struct address_space *mapping; |
| 138 | struct netfs_cache_resources cache_resources; |
| 139 | void *netfs_priv; |
| 140 | loff_t start; |
| 141 | size_t len; |
| 142 | loff_t i_size; |
| 143 | const struct netfs_read_request_ops *netfs_ops; |
| 144 | unsigned int debug_id; |
| 145 | ... |
| 146 | }; |
| 147 | |
| 148 | The above fields are the ones the netfs can use. They are: |
| 149 | |
| 150 | * ``inode`` |
| 151 | * ``mapping`` |
| 152 | |
| 153 | The inode and the address space of the file being read from. The mapping |
| 154 | may or may not point to inode->i_data. |
| 155 | |
| 156 | * ``cache_resources`` |
| 157 | |
| 158 | Resources for the local cache to use, if present. |
| 159 | |
| 160 | * ``netfs_priv`` |
| 161 | |
| 162 | The network filesystem's private data. The value for this can be passed in |
| 163 | to the helper functions or set during the request. The ->cleanup() op will |
| 164 | be called if this is non-NULL at the end. |
| 165 | |
| 166 | * ``start`` |
| 167 | * ``len`` |
| 168 | |
| 169 | The file position of the start of the read request and the length. These |
| 170 | may be altered by the ->expand_readahead() op. |
| 171 | |
| 172 | * ``i_size`` |
| 173 | |
| 174 | The size of the file at the start of the request. |
| 175 | |
| 176 | * ``netfs_ops`` |
| 177 | |
| 178 | A pointer to the operation table. The value for this is passed into the |
| 179 | helper functions. |
| 180 | |
| 181 | * ``debug_id`` |
| 182 | |
| 183 | A number allocated to this operation that can be displayed in trace lines |
| 184 | for reference. |
| 185 | |
| 186 | |
| 187 | The second structure is used to manage individual slices of the overall read |
| 188 | request:: |
| 189 | |
| 190 | struct netfs_read_subrequest { |
| 191 | struct netfs_read_request *rreq; |
| 192 | loff_t start; |
| 193 | size_t len; |
| 194 | size_t transferred; |
| 195 | unsigned long flags; |
| 196 | unsigned short debug_index; |
| 197 | ... |
| 198 | }; |
| 199 | |
| 200 | Each subrequest is expected to access a single source, though the helpers will |
| 201 | handle falling back from one source type to another. The members are: |
| 202 | |
| 203 | * ``rreq`` |
| 204 | |
| 205 | A pointer to the read request. |
| 206 | |
| 207 | * ``start`` |
| 208 | * ``len`` |
| 209 | |
| 210 | The file position of the start of this slice of the read request and the |
| 211 | length. |
| 212 | |
| 213 | * ``transferred`` |
| 214 | |
| 215 | The amount of data transferred so far of the length of this slice. The |
| 216 | network filesystem or cache should start the operation this far into the |
| 217 | slice. If a short read occurs, the helpers will call again, having updated |
| 218 | this to reflect the amount read so far. |
| 219 | |
| 220 | * ``flags`` |
| 221 | |
| 222 | Flags pertaining to the read. There are two of interest to the filesystem |
| 223 | or cache: |
| 224 | |
| 225 | * ``NETFS_SREQ_CLEAR_TAIL`` |
| 226 | |
| 227 | This can be set to indicate that the remainder of the slice, from |
| 228 | transferred to len, should be cleared. |
| 229 | |
| 230 | * ``NETFS_SREQ_SEEK_DATA_READ`` |
| 231 | |
| 232 | This is a hint to the cache that it might want to try skipping ahead to |
| 233 | the next data (ie. using SEEK_DATA). |
| 234 | |
| 235 | * ``debug_index`` |
| 236 | |
| 237 | A number allocated to this slice that can be displayed in trace lines for |
| 238 | reference. |
| 239 | |
| 240 | |
| 241 | Read Helper Operations |
| 242 | ---------------------- |
| 243 | |
| 244 | The network filesystem must provide the read helpers with a table of operations |
| 245 | through which it can issue requests and negotiate:: |
| 246 | |
| 247 | struct netfs_read_request_ops { |
| 248 | void (*init_rreq)(struct netfs_read_request *rreq, struct file *file); |
| 249 | bool (*is_cache_enabled)(struct inode *inode); |
| 250 | int (*begin_cache_operation)(struct netfs_read_request *rreq); |
| 251 | void (*expand_readahead)(struct netfs_read_request *rreq); |
| 252 | bool (*clamp_length)(struct netfs_read_subrequest *subreq); |
| 253 | void (*issue_op)(struct netfs_read_subrequest *subreq); |
| 254 | bool (*is_still_valid)(struct netfs_read_request *rreq); |
| 255 | int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, |
| 256 | struct page *page, void **_fsdata); |
| 257 | void (*done)(struct netfs_read_request *rreq); |
| 258 | void (*cleanup)(struct address_space *mapping, void *netfs_priv); |
| 259 | }; |
| 260 | |
| 261 | The operations are as follows: |
| 262 | |
| 263 | * ``init_rreq()`` |
| 264 | |
| 265 | [Optional] This is called to initialise the request structure. It is given |
| 266 | the file for reference and can modify the ->netfs_priv value. |
| 267 | |
| 268 | * ``is_cache_enabled()`` |
| 269 | |
| 270 | [Required] This is called by netfs_write_begin() to ask if the file is being |
| 271 | cached. It should return true if it is being cached and false otherwise. |
| 272 | |
| 273 | * ``begin_cache_operation()`` |
| 274 | |
| 275 | [Optional] This is called to ask the network filesystem to call into the |
| 276 | cache (if present) to initialise the caching state for this read. The netfs |
| 277 | library module cannot access the cache directly, so the cache should call |
| 278 | something like fscache_begin_read_operation() to do this. |
| 279 | |
| 280 | The cache gets to store its state in ->cache_resources and must set a table |
| 281 | of operations of its own there (though of a different type). |
| 282 | |
| 283 | This should return 0 on success and an error code otherwise. If an error is |
| 284 | reported, the operation may proceed anyway, just without local caching (only |
| 285 | out of memory and interruption errors cause failure here). |
| 286 | |
| 287 | * ``expand_readahead()`` |
| 288 | |
| 289 | [Optional] This is called to allow the filesystem to expand the size of a |
| 290 | readahead read request. The filesystem gets to expand the request in both |
| 291 | directions, though it's not permitted to reduce it as the numbers may |
| 292 | represent an allocation already made. If local caching is enabled, it gets |
| 293 | to expand the request first. |
| 294 | |
| 295 | Expansion is communicated by changing ->start and ->len in the request |
| 296 | structure. Note that if any change is made, ->len must be increased by at |
| 297 | least as much as ->start is reduced. |
| 298 | |
| 299 | * ``clamp_length()`` |
| 300 | |
| 301 | [Optional] This is called to allow the filesystem to reduce the size of a |
| 302 | subrequest. The filesystem can use this, for example, to chop up a request |
| 303 | that has to be split across multiple servers or to put multiple reads in |
| 304 | flight. |
| 305 | |
| 306 | This should return 0 on success and an error code on error. |
| 307 | |
| 308 | * ``issue_op()`` |
| 309 | |
| 310 | [Required] The helpers use this to dispatch a subrequest to the server for |
| 311 | reading. In the subrequest, ->start, ->len and ->transferred indicate what |
| 312 | data should be read from the server. |
| 313 | |
| 314 | There is no return value; the netfs_subreq_terminated() function should be |
| 315 | called to indicate whether or not the operation succeeded and how much data |
| 316 | it transferred. The filesystem also should not deal with setting pages |
| 317 | uptodate, unlocking them or dropping their refs - the helpers need to deal |
| 318 | with this as they have to coordinate with copying to the local cache. |
| 319 | |
| 320 | Note that the helpers have the pages locked, but not pinned. It is possible |
| 321 | to use the ITER_XARRAY iov iterator to refer to the range of the inode that |
| 322 | is being operated upon without the need to allocate large bvec tables. |
| 323 | |
| 324 | * ``is_still_valid()`` |
| 325 | |
| 326 | [Optional] This is called to find out if the data just read from the local |
| 327 | cache is still valid. It should return true if it is still valid and false |
| 328 | if not. If it's not still valid, it will be reread from the server. |
| 329 | |
| 330 | * ``check_write_begin()`` |
| 331 | |
| 332 | [Optional] This is called from the netfs_write_begin() helper once it has |
| 333 | allocated/grabbed the page to be modified to allow the filesystem to flush |
| 334 | conflicting state before allowing it to be modified. |
| 335 | |
| 336 | It should return 0 if everything is now fine, -EAGAIN if the page should be |
| 337 | regrabbed and any other error code to abort the operation. |
| 338 | |
| 339 | * ``done`` |
| 340 | |
| 341 | [Optional] This is called after the pages in the request have all been |
| 342 | unlocked (and marked uptodate if applicable). |
| 343 | |
| 344 | * ``cleanup`` |
| 345 | |
| 346 | [Optional] This is called as the request is being deallocated so that the |
| 347 | filesystem can clean up ->netfs_priv. |
| 348 | |
| 349 | |
| 350 | |
| 351 | Read Helper Procedure |
| 352 | --------------------- |
| 353 | |
| 354 | The read helpers work by the following general procedure: |
| 355 | |
| 356 | * Set up the request. |
| 357 | |
| 358 | * For readahead, allow the local cache and then the network filesystem to |
| 359 | propose expansions to the read request. This is then proposed to the VM. |
| 360 | If the VM cannot fully perform the expansion, a partially expanded read will |
| 361 | be performed, though this may not get written to the cache in its entirety. |
| 362 | |
| 363 | * Loop around slicing chunks off of the request to form subrequests: |
| 364 | |
| 365 | * If a local cache is present, it gets to do the slicing, otherwise the |
| 366 | helpers just try to generate maximal slices. |
| 367 | |
| 368 | * The network filesystem gets to clamp the size of each slice if it is to be |
| 369 | the source. This allows rsize and chunking to be implemented. |
| 370 | |
| 371 | * The helpers issue a read from the cache or a read from the server or just |
| 372 | clears the slice as appropriate. |
| 373 | |
| 374 | * The next slice begins at the end of the last one. |
| 375 | |
| 376 | * As slices finish being read, they terminate. |
| 377 | |
| 378 | * When all the subrequests have terminated, the subrequests are assessed and |
| 379 | any that are short or have failed are reissued: |
| 380 | |
| 381 | * Failed cache requests are issued against the server instead. |
| 382 | |
| 383 | * Failed server requests just fail. |
| 384 | |
| 385 | * Short reads against either source will be reissued against that source |
| 386 | provided they have transferred some more data: |
| 387 | |
| 388 | * The cache may need to skip holes that it can't do DIO from. |
| 389 | |
| 390 | * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the |
| 391 | end of the slice instead of reissuing. |
| 392 | |
| 393 | * Once the data is read, the pages that have been fully read/cleared: |
| 394 | |
| 395 | * Will be marked uptodate. |
| 396 | |
| 397 | * If a cache is present, will be marked with PG_fscache. |
| 398 | |
| 399 | * Unlocked |
| 400 | |
| 401 | * Any pages that need writing to the cache will then have DIO writes issued. |
| 402 | |
| 403 | * Synchronous operations will wait for reading to be complete. |
| 404 | |
| 405 | * Writes to the cache will proceed asynchronously and the pages will have the |
| 406 | PG_fscache mark removed when that completes. |
| 407 | |
| 408 | * The request structures will be cleaned up when everything has completed. |
| 409 | |
| 410 | |
| 411 | Read Helper Cache API |
| 412 | --------------------- |
| 413 | |
| 414 | When implementing a local cache to be used by the read helpers, two things are |
| 415 | required: some way for the network filesystem to initialise the caching for a |
| 416 | read request and a table of operations for the helpers to call. |
| 417 | |
| 418 | The network filesystem's ->begin_cache_operation() method is called to set up a |
| 419 | cache and this must call into the cache to do the work. If using fscache, for |
| 420 | example, the cache would call:: |
| 421 | |
| 422 | int fscache_begin_read_operation(struct netfs_read_request *rreq, |
| 423 | struct fscache_cookie *cookie); |
| 424 | |
| 425 | passing in the request pointer and the cookie corresponding to the file. |
| 426 | |
| 427 | The netfs_read_request object contains a place for the cache to hang its |
| 428 | state:: |
| 429 | |
| 430 | struct netfs_cache_resources { |
| 431 | const struct netfs_cache_ops *ops; |
| 432 | void *cache_priv; |
| 433 | void *cache_priv2; |
| 434 | }; |
| 435 | |
| 436 | This contains an operations table pointer and two private pointers. The |
| 437 | operation table looks like the following:: |
| 438 | |
| 439 | struct netfs_cache_ops { |
| 440 | void (*end_operation)(struct netfs_cache_resources *cres); |
| 441 | |
| 442 | void (*expand_readahead)(struct netfs_cache_resources *cres, |
| 443 | loff_t *_start, size_t *_len, loff_t i_size); |
| 444 | |
| 445 | enum netfs_read_source (*prepare_read)(struct netfs_read_subrequest *subreq, |
| 446 | loff_t i_size); |
| 447 | |
| 448 | int (*read)(struct netfs_cache_resources *cres, |
| 449 | loff_t start_pos, |
| 450 | struct iov_iter *iter, |
| 451 | bool seek_data, |
| 452 | netfs_io_terminated_t term_func, |
| 453 | void *term_func_priv); |
| 454 | |
| 455 | int (*write)(struct netfs_cache_resources *cres, |
| 456 | loff_t start_pos, |
| 457 | struct iov_iter *iter, |
| 458 | netfs_io_terminated_t term_func, |
| 459 | void *term_func_priv); |
| 460 | }; |
| 461 | |
| 462 | With a termination handler function pointer:: |
| 463 | |
| 464 | typedef void (*netfs_io_terminated_t)(void *priv, |
| 465 | ssize_t transferred_or_error, |
| 466 | bool was_async); |
| 467 | |
| 468 | The methods defined in the table are: |
| 469 | |
| 470 | * ``end_operation()`` |
| 471 | |
| 472 | [Required] Called to clean up the resources at the end of the read request. |
| 473 | |
| 474 | * ``expand_readahead()`` |
| 475 | |
| 476 | [Optional] Called at the beginning of a netfs_readahead() operation to allow |
| 477 | the cache to expand a request in either direction. This allows the cache to |
| 478 | size the request appropriately for the cache granularity. |
| 479 | |
| 480 | The function is passed poiners to the start and length in its parameters, |
| 481 | plus the size of the file for reference, and adjusts the start and length |
| 482 | appropriately. It should return one of: |
| 483 | |
| 484 | * ``NETFS_FILL_WITH_ZEROES`` |
| 485 | * ``NETFS_DOWNLOAD_FROM_SERVER`` |
| 486 | * ``NETFS_READ_FROM_CACHE`` |
| 487 | * ``NETFS_INVALID_READ`` |
| 488 | |
| 489 | to indicate whether the slice should just be cleared or whether it should be |
| 490 | downloaded from the server or read from the cache - or whether slicing |
| 491 | should be given up at the current point. |
| 492 | |
| 493 | * ``prepare_read()`` |
| 494 | |
| 495 | [Required] Called to configure the next slice of a request. ->start and |
| 496 | ->len in the subrequest indicate where and how big the next slice can be; |
| 497 | the cache gets to reduce the length to match its granularity requirements. |
| 498 | |
| 499 | * ``read()`` |
| 500 | |
| 501 | [Required] Called to read from the cache. The start file offset is given |
| 502 | along with an iterator to read to, which gives the length also. It can be |
| 503 | given a hint requesting that it seek forward from that start position for |
| 504 | data. |
| 505 | |
| 506 | Also provided is a pointer to a termination handler function and private |
| 507 | data to pass to that function. The termination function should be called |
| 508 | with the number of bytes transferred or an error code, plus a flag |
| 509 | indicating whether the termination is definitely happening in the caller's |
| 510 | context. |
| 511 | |
| 512 | * ``write()`` |
| 513 | |
| 514 | [Required] Called to write to the cache. The start file offset is given |
| 515 | along with an iterator to write from, which gives the length also. |
| 516 | |
| 517 | Also provided is a pointer to a termination handler function and private |
| 518 | data to pass to that function. The termination function should be called |
| 519 | with the number of bytes transferred or an error code, plus a flag |
| 520 | indicating whether the termination is definitely happening in the caller's |
| 521 | context. |
| 522 | |
| 523 | Note that these methods are passed a pointer to the cache resource structure, |
| 524 | not the read request structure as they could be used in other situations where |
| 525 | there isn't a read request structure as well, such as writing dirty data to the |
| 526 | cache. |