Redfs ubuntu hwe 6.17.0 16.16 24.04.1#102
Open
hbirth wants to merge 47 commits intoDDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1-pickedfrom
Open
Redfs ubuntu hwe 6.17.0 16.16 24.04.1#102hbirth wants to merge 47 commits intoDDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1-pickedfrom
hbirth wants to merge 47 commits intoDDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1-pickedfrom
Conversation
This is especially needed for better ftrace analysis, for example to build histograms. So far the request unique was missing, because it was added after the first trace message. IDs/req-unique now might not come up perfectly sequentially anymore, but especially with cloned device or io-uring this did not work perfectly anyway. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit 4415892)
fuse_uring_send_next_to_ring() can just call into fuse_uring_send and avoid code dup. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit 9efaa8d)
Rename trace_fuse_request_send to trace_fuse_request_enqueue Add trace_fuse_request_send Add trace_fuse_request_bg_enqueue Add trace_fuse_request_enqueue This helps to track entire request time and time in different queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit 4a7f142)
If pinned pages are used the application can write into these pages and io_uring_cmd_complete_in_task() is not needed. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit 5f0264c)
readhead is currently limited to bdi->ra_pages. One can change that after the mount with something like minor=$(stat -c "%d" /path/to/fuse) echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb Issue is that fuse-server cannot do that from its ->init method, as it has to know about device minor, which blocks before init is complete. Fuse already sets the bdi value, but upper limit is the current bdi value. For CAP_SYS_ADMIN we can allow higher values. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit 763c96d)
Due to user buffer misalignent we actually need one page more, i.e. 1025 instead of 1024, will be handled differently. For now we just bump up the max. (imported from commit 3f71501)
When having writeback cache enabled it is beneficial for data consistency to communicate to the FUSE server when the kernel prepares a page for caching. This lets the FUSE server react and lock the page. Additionally the kernel lets the FUSE server decide how much data it locks by the same call and keeps the given information in the dlm lock management. If the feature is not supported it will be disabled after first unsuccessful use. - Add DLM_LOCK fuse opcode - Add cache page lock caching for writeback cache functionality. This means sending out a FUSE call whenever the kernel prepares a page for writeback cache. The kernel will manage the cache so that it will keep track of already acquired locks. (except for the case that is documented in the code) - Use rb-trees for the management of the already 'locked' page ranges - Use rw_semaphore for synchronization in fuse_dlm_cache (imported from commit 287c884)
Add support to invalidate inode aliases when doing inode invalidation. This is useful for distributed file systems, which use DLM for cache coherency. So, when a client losts its inode lock, it should invalidate its inode cache and dentry cache since the other client may delete this file after getting inode lock. Signed-off-by: Yong Ze Chen <yochen@ddn.com> (imported from commit 49720b5)
Renumber the operation code to a high value to avoid conflicts with upstream. (imported from commit 27a0e9e)
Send a DLM_WB_LOCK request in the page_mkwrite handler to enable FUSE filesystems to acquire a distributed lock manager (DLM) lock for protecting upcoming dirty pages when a previously read-only mapped page is about to be written. Signed-off-by: Cheng Ding <cding@ddn.com> (imported from commit ec36c45)
Allow read_folio to return EAGAIN error and translate it to AOP_TRUNCATE_PAGE to retry page fault and read operations. This is used to prevent deadlock of folio lock/DLM lock order reversal: - Fault or read operations acquire folio lock first, then DLM lock. - FUSE daemon blocks new DLM lock acquisition while it invalidating page cache. invalidate_inode_pages2_range() acquires folio lock To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition with EAGAIN if it detects an in-flight page cache invalidating operation. Signed-off-by: Cheng Ding <cding@ddn.com> (imported from commit 8ecf118)
generic/488 fails with fuse2fs in the following fashion: generic/488 _check_generic_filesystem: filesystem on /dev/sdf is inconsistent (see /var/tmp/fstests/generic/488.full for details) This test opens a large number of files, unlinks them (which really just renames them to fuse hidden files), closes the program, unmounts the filesystem, and runs fsck to check that there aren't any inconsistencies in the filesystem. Unfortunately, the 488.full file shows that there are a lot of hidden files left over in the filesystem, with incorrect link counts. Tracing fuse_request_* shows that there are a large number of FUSE_RELEASE commands that are queued up on behalf of the unlinked files at the time that fuse_conn_destroy calls fuse_abort_conn. Had the connection not aborted, the fuse server would have responded to the RELEASE commands by removing the hidden files; instead they stick around. Create a function to push all the background requests to the queue and then wait for the number of pending events to hit zero, and call this before fuse_abort_conn. That way, all the pending events are processed by the fuse server and we don't end up with a corrupt filesystem. Signed-off-by: Darrick J. Wong <djwong@kernel.org> (imported from commit d4262f9)
This is a preparation to allow fuse-io-uring bg queue flush from flush_bg_queue() This does two function renames: fuse_uring_flush_bg -> fuse_uring_flush_queue_bg fuse_uring_abort_end_requests -> fuse_uring_flush_bg And fuse_uring_abort_end_queue_requests() is moved to fuse_uring_stop_queues(). Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit e70ef24)
This is useful to have a unique API to flush background requests. For example when the bg queue gets flushed before the remaining of fuse_conn_destroy(). Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit fc4120c)
When calling the fuse server with a dlm request and the fuse server responds with some other error than ENOSYS most likely the lock size will be set to zero. In that case the kernel will abort the fuse connection. This is completely unnecessary. Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (imported from commit 0bc2f9c)
Check whether dlm is still enabled when interpreting the returned error from fuse server. Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (imported from commit f6fbf7c)
Sometimes the file offset alignment needs to be opt-in to achieve the optimum performance at the backend store. For example when ErasureCode [1] is used at the backend store, the optimum write performance is achieved when the WRITE request is aligned with the stripe size of ErasureCode. Otherwise a non-aligned WRITE request needs to be split at the stripe size boundary. It is quite costly to handle these split partial requests, as firstly the whole stripe to which the split partial request belongs needs to be read out, then overwrite the read stripe buffer with the request, and finally write the whole stripe back to the persistent storage. Thus the backend store can suffer severe performance degradation when WRITE requests can not fit into one stripe exactly. The write performance can be 10x slower when the request is 256KB in size given 4MB stripe size. Also there can be 50% performance degradation in theory if the request is not stripe boundary aligned. Besides, the conveyed test indicates that, the non-alignment issue becomes more severe when decreasing fuse's max_ratio, maybe partly because the background writeback now is more likely to run parallelly with the dirtier. fuse's max_ratio ratio of aligned WRITE requests ---------------- ------------------------------- 70 99.9% 40 74% 20 45% 10 20% With the patched version, which makes the alignment constraint opt-in when constructing WRITE requests, the ratio of aligned WRITE requests increases to 98% (previously 20%) when fuse's max_ratio is 10. fuse: fix alignment to work with redfs ubuntu - small fix to make the fuse alignment patch work with redfs ubuntu 6.8.x - add writeback_control to fuse_writepage_need_send() to make more accurate decisions about when to skip sending data - fix shift number for FUSE_ALIGN_PG_ORDER - remove test code [1] https://lore.kernel.org/linux-fsdevel/20240124070512.52207-1-jefflexu@linux.alibaba.com/T/#m9bce469998ea6e4f911555c6f7be1e077ce3d8b4 Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (imported from commit 5e590a6)
- Increase the possible lock size to 64 bit. - change semantics of DLM locks to request start and end - change semantics of DLM request return to mark start and end of the locked area - better prepare dlm lock range cache rb-tree for unaligned byte range locks which could return any value as long as it is larger than the range requested - add the case where start and end are zero to destroy the cache Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (imported from commit 87968c7)
Fix reference count leak of payload pages during fuse argument copies. Signed-off-by: Cheng Ding <cding@ddn.com> (imported from commit 8b75cf0)
This is another preparation and will be used for decision which queue to add a request to. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> (imported from commit e4698fa)
This is preparation for follow up commits that allow to run with a reduced number of queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit 2e27c33)
Add per-CPU and per-NUMA node bitmasks to track which io-uring queues are registered. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit be6edce)
Queues selection (fuse_uring_get_queue) can handle reduced number queues - using io-uring is possible now even with a single queue and entry. The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that reduced queues are possible, i.e. if the flag is set, fuse server is free to reduce number queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit f620f3d)
Running background IO on a different core makes quite a difference. fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \ --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\ --runtime=30s --group_reporting --ioengine=io_uring\ --direct=1 unpatched READ: bw=272MiB/s (285MB/s) ... patched READ: bw=650MiB/s (682MB/s) Reason is easily visible, the fio process is migrating between CPUs when requests are submitted on the queue for the same core. With --iodepth=8 unpatched READ: bw=466MiB/s (489MB/s) patched READ: bw=641MiB/s (672MB/s) Without io-uring (--iodepth=8) READ: bw=729MiB/s (764MB/s) Without fuse (--iodepth=8) READ: bw=2199MiB/s (2306MB/s) (Test were done with <libfuse>/example/passthrough_hp -o allow_other --nopassthrough \ [-o io_uring] /tmp/source /tmp/dest ) Additional notes: With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8) READ: bw=903MiB/s (946MB/s) With just a random qid (--iodepth=8) READ: bw=429MiB/s (450MB/s) With --iodepth=1 unpatched READ: bw=195MiB/s (204MB/s) patched READ: bw=232MiB/s (243MB/s) With --iodepth=1 --numjobs=2 unpatched READ: bw=366MiB/s (384MB/s) patched READ: bw=472MiB/s (495MB/s) With --iodepth=1 --numjobs=8 unpatched READ: bw=1437MiB/s (1507MB/s) patched READ: bw=1529MiB/s (1603MB/s) fuse without io-uring READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ... no-fuse READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ... In summary, for async requests the core doing application IO is busy sending requests and processing IOs should be done on a different core. Spreading the load on random cores is also not desirable, as the core might be frequency scaled down and/or in C1 sleep states. Not shown here, but differnces are much smaller when the system uses performance govenor instead of schedutil (ubuntu default). Obviously at the cost of higher system power consumption for performance govenor - not desirable either. Results without io-uring (which uses fixed libfuse threads per queue) heavily depend on the current number of active threads. Libfuse uses default of max 10 threads, but actual nr max threads is a parameter. Also, no-fuse-io-uring results heavily depend on, if there was already running another workload before, as libfuse starts these threads dynamically - i.e. the more threads are active, the worse the performance. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit c6399ea)
This is to further improve performance. fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \ --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\ --runtime=30s --group_reporting --ioengine=io_uring\ --direct=1 unpatched READ: bw=650MiB/s (682MB/s) patched: READ: bw=995MiB/s (1043MB/s) with --iodepth=8 unpatched READ: bw=641MiB/s (672MB/s) patched READ: bw=966MiB/s (1012MB/s) Reason is that with --iodepth=x (x > 1) fio submits multiple async requests and a single queue might become CPU limited. I.e. spreading the load helps. (imported from commit 2e73b0b)
generic_file_direct_write() also does this and has a large comment about. Reproducer here is xfstest's generic/209, which is exactly to have competing DIO write and cached IO read. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit 9e04c8a)
This was done as condition on direct_io_allow_mmap, but I believe this is not right, as a file might be open two times - once with write-back enabled another time with FOPEN_DIRECT_IO. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit dfca338)
With the reduced queue feature io-uring is marked as ready after
receiving the 1st ring entry. At this time other queues just
might be in the process of registration and then a race happens
fuse_uring_queue_fuse_req -> no queue entry registered yet
list_add_tail -> fuse request gets queued
So far fetching requests from the list only happened from
FUSE_IO_URING_CMD_COMMIT_AND_FETCH, but without new requests
on the same queue, it would actually never send requests
from that queue - the request was stuck.
(imported from commit 3bfb6cd)
fuse.h: add new opcode FUSE_COMPOUND fuse_compound.c: add new functionality to pack multiple fuse operations into one compound command file.c: add an implementation of open+getattr Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (imported from commit d9e7351)
Simplify fuse_compound_req to hold only the pointers to the added fuse args and the request housekeeping. Simplify open+getattr call by using helper functions to fill out the fuse request parameters Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (imported from commit 1607a03) (imported from commit 9df5e4c) (imported from commit 9921bcd) (imported from commit 09d6f59)
There was a race between fuse_uring_cancel() and
fuse_uring_register()/fuse_uring_next_fuse_req(),
which comes from the queue reduction feature.
Race was
core-A core-B
fuse_uring_register
spin_lock(&queue->lock);
fuse_uring_ent_avail()
spin_unlock(&queue->lock);
fuse_uring_cancel()
spin_lock(&queue->lock);
ent->state = FRRS_USERSPACE;
list_move()
fuse_uring_next_fuse_req()
spin_lock(&queue->lock);
fuse_uring_ent_avail(ent, queue);
fuse_uring_send_next_to_ring()
spin_unlock(&queue->lock);
fuse_uring_send_next_to_ring
I.e. fuse_uring_ent_avail() was called two times and the 2nd time
when the entry was actually already handled by fuse_uring_cancel().
Solution is to not call fuse_uring_ent_avail() from
fuse_uring_register. With that the entry is not in state
FRRS_AVAILABLE and fuse_uring_cancel() will not touch it.
fuse_uring_send_next_to_ring() will mark it as FRRS_AVAILABLE,
and then either assign a request to it and change state again
or will not touch it at all anymore - race fixed.
This will be folded into the upstream queue reduction patches
and therefore has the RED-34640 commit message.
Also entirely removed is fuse_uring_do_register() as remaining
work can be done by the caller.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(imported from commit 932feba)
This is just to avoid code dup with an upcoming commit. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit ec3217f)
This issue could be observed sometimes during libfuse xfstests, from dmseg prints some like "kernel: WARNING: CPU: 4 PID: 0 at fs/fuse/dev_uring.c:204 fuse_uring_destruct+0x1f5/0x200 [fuse]". The cause is, if when fuse daemon just submitted FUSE_IO_URING_CMD_REGISTER SQEs, then umount or fuse daemon quits at this very early stage. After all uring queues stopped, might have one or more unprocessed FUSE_IO_URING_CMD_REGISTER SQEs get processed then some new ring entities are created and added to ent_avail_queue, and immediately fuse_uring_cancel moved them to ent_in_userspace after SQEs get canceled. These ring entities were not moved to ent_released, and stayed in ent_in_userspace when fuse_uring_destruct was called. One way to solve it would be to also free 'ent_in_userspace' in fuse_uring_destruct(), but from code point of view it is hard to see why it is needed. As suggested by Joanne, another solution is to avoid moving entries in fuse_uring_cancel() to the 'ent_in_userspace' list and just releasing them directly. Fixes: b6236c8 ("fuse: {io-uring} Prevent mount point hang on fuse-server termination") Cc: Joanne Koong <joannelkoong@gmail.com> Cc: <stable@vger.kernel.org> # v6.14 Signed-off-by: Jian Huang Li <ali@ddn.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit 30d0473)
When a request is terminated before it has been committed, the request is not removed from the queue's list. This leaves a dangling list entry that leads to list corruption and use-after-free issues. Remove the request from the queue's list for terminated non-committed requests. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Fixes: c090c8a ("fuse: Add io-uring sqe commit and fetch support") Cc: stable@vger.kernel.org Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (imported from commit 07214a0)
This fixes a memory leak. (imported from commit f75b62f)
FUSE_INIT has always been asynchronous with mount. That means that the server processed this request after the mount syscall returned. This means that FUSE_INIT can't supply the root inode's ID, hence it currently has a hardcoded value. There are other limitations such as not being able to perform getxattr during mount, which is needed by selinux. To remove these limitations allow server to process FUSE_INIT while initializing the in-core super block for the fuse filesystem. This can only be done if the server is prepared to handle this, so add FUSE_DEV_IOC_SYNC_INIT ioctl, which a) lets the server know whether this feature is supported, returning ENOTTY othewrwise. b) lets the kernel know to perform a synchronous initialization The implementation is slightly tricky, since fuse_dev/fuse_conn are set up only during super block creation. This is solved by setting the private data of the fuse device file to a special value ((struct fuse_dev *) 1) and waiting for this to be turned into a proper fuse_dev before commecing with operations on the device file. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (imported from commit dfb84c3)
no functional changes Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (imported from commit f0bccb2)
For now compounds are a module option and disabled by default Signed-off-by: Bernd Schubert <bschubert@ddn.com> (imported from commit f3b301d)
The use of bitmap_weight() didn't give the actual index, but always returned the current cpu, which resulted in a totally wrong mapping. It now just increases a counter for every mapping and ignores cores not in the given (numa) map and then find the index for that. Also added is a pr_debug(), which can be activated for example with echo "module redfs +p" >/proc/dynamic_debug/control (Pity that upstream is not open for such debug messages). (imported from commit bcbb684)
Fix the include sequence which causes a compiling error on aarch64. (imported from commit f5fed0e)
Mapping might point to a totally different core due to
random assignment. For performance using the current
core might be beneficial
Example (with core binding)
unpatched WRITE: bw=841MiB/s
patched WRITE: bw=1363MiB/s
With
fio --name=test --ioengine=psync --direct=1 \
--rw=write --bs=1M --iodepth=1 --numjobs=1 \
--filename_format=/redfs/testfile.\$jobnum --size=100G \
--thread --create_on_open=1 --runtime=30s --cpus_allowed=1
In order to get the good number `--cpus_allowed=1` is needed.
This could be improved by a future change that avoids
cpu migration in fuse_request_end() on wake_up() call.
(imported from commit 32e0073)
This fixes xfstests generic/451 (for both O_DIRECT and FOPEN_DIRECT_IO direct write). Commit b359af8 ("fuse: Invalidate the page cache after FOPEN_DIRECT_IO write") tries to fix the similar issue for FOPEN_DIRECT_IO write, which can be reproduced by xfstests generic/209. It only fixes the issue for synchronous direct write, while omitting the case for asynchronous direct write (exactly targeted by generic/451). While for O_DIRECT direct write, it's somewhat more complicated. For synchronous direct write, generic_file_direct_write() will invalidate the page cache after the write, and thus it can pass generic/209. While for asynchronous direct write, the invalidation in generic_file_direct_write() is bypassed since the invalidation shall be done when the asynchronous IO completes. This is omitted in FUSE and generic/451 fails whereby. Fix this by conveying the invalidation for both synchronous and asynchronous write. - with FOPEN_DIRECT_IO - sync write, invalidate in fuse_send_write() - async write, invalidate in fuse_aio_complete() with FUSE_ASYNC_DIO, fuse_send_write() otherwise - without FOPEN_DIRECT_IO - sync write, invalidate in generic_file_direct_write() - async write, invalidate in fuse_aio_complete() with FUSE_ASYNC_DIO, generic_file_direct_write() otherwise Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> (imported from commit f6de786)
8d8f77b to
43ca84c
Compare
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.