Skip to content

Redfs ubuntu hwe 6.17.0 16.16 24.04.1#102

Open
hbirth wants to merge 47 commits intoDDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1-pickedfrom
hbirth:redfs-ubuntu-hwe-6.17.0-16.16_24.04.1
Open

Redfs ubuntu hwe 6.17.0 16.16 24.04.1#102
hbirth wants to merge 47 commits intoDDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1-pickedfrom
hbirth:redfs-ubuntu-hwe-6.17.0-16.16_24.04.1

Conversation

@hbirth
Copy link
Collaborator

@hbirth hbirth commented Feb 21, 2026

No description provided.

This is especially needed for better ftrace analysis,
for example to build histograms. So far the request unique
was missing, because it was added after the first trace message.

IDs/req-unique now might not come up perfectly sequentially
anymore, but especially  with cloned device or io-uring this
did not work perfectly anyway.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 4415892)
@hbirth hbirth requested a review from bsbernd February 21, 2026 08:16
bsbernd and others added 28 commits February 21, 2026 19:25
fuse_uring_send_next_to_ring() can just call into fuse_uring_send
and avoid code dup.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 9efaa8d)
Rename trace_fuse_request_send to trace_fuse_request_enqueue
Add trace_fuse_request_send
Add trace_fuse_request_bg_enqueue
Add trace_fuse_request_enqueue

This helps to track entire request time and time in different
queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 4a7f142)
This is to allow copying into the buffer from the application
without the need to copy in ring context (and with that,
the need that the ring task is active in kernel space).

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit 43d1a63)

(imported from commit ea01f94)
If pinned pages are used the application can write into these
pages and io_uring_cmd_complete_in_task() is not needed.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 5f0264c)
readhead is currently limited to bdi->ra_pages. One can change
that after the mount with something like

minor=$(stat -c "%d" /path/to/fuse)
echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb

Issue is that fuse-server cannot do that from its ->init method,
as it has to know about device minor, which blocks before
init is complete.

Fuse already sets the bdi value, but upper limit is the current
bdi value. For CAP_SYS_ADMIN we can allow higher values.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 763c96d)
Due to user buffer misalignent we actually need one page more,
i.e. 1025 instead of 1024, will be handled differently.
For now we just bump up the max.

(imported from commit 3f71501)
When having writeback cache enabled it is beneficial for data consistency
to communicate to the FUSE server when the kernel prepares a page for caching.
This lets the FUSE server react and lock the page.

Additionally the kernel lets the FUSE server decide how much data it locks by the
same call and keeps the given information in the dlm lock management.

If the feature is not supported it will be disabled after first unsuccessful use.

- Add DLM_LOCK fuse opcode
- Add cache page lock caching for writeback cache functionality.
This means sending out a FUSE call whenever the kernel prepares a page
for writeback cache. The kernel will manage the cache so that it will keep
track of already acquired locks.
(except for the case that is documented in the code)
- Use rb-trees for the management of the already 'locked' page ranges
- Use rw_semaphore for synchronization in fuse_dlm_cache

(imported from commit 287c884)
Add support to invalidate inode aliases when doing inode invalidation.
This is useful for distributed file systems, which use DLM for cache
coherency. So, when a client losts its inode lock, it should invalidate
its inode cache and dentry cache since the other client may delete
this file after getting inode lock.

Signed-off-by: Yong Ze Chen <yochen@ddn.com>

(imported from commit 49720b5)
Renumber the operation code to a high value to avoid conflicts with upstream.

(imported from commit 27a0e9e)
Send a DLM_WB_LOCK request in the page_mkwrite handler to enable FUSE
filesystems to acquire a distributed lock manager (DLM) lock for
protecting upcoming dirty pages when a previously read-only mapped
page is about to be written.

Signed-off-by: Cheng Ding <cding@ddn.com>

(imported from commit ec36c45)
Allow read_folio to return EAGAIN error and translate it to
AOP_TRUNCATE_PAGE to retry page fault and read operations.
This is used to prevent deadlock of folio lock/DLM lock order reversal:
 - Fault or read operations acquire folio lock first, then DLM lock.
 - FUSE daemon blocks new DLM lock acquisition while it invalidating
   page cache. invalidate_inode_pages2_range() acquires folio lock
To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition
with EAGAIN if it detects an in-flight page cache invalidating
operation.

Signed-off-by: Cheng Ding <cding@ddn.com>

(imported from commit 8ecf118)
generic/488 fails with fuse2fs in the following fashion:

generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
(see /var/tmp/fstests/generic/488.full for details)

This test opens a large number of files, unlinks them (which really just
renames them to fuse hidden files), closes the program, unmounts the
filesystem, and runs fsck to check that there aren't any inconsistencies
in the filesystem.

Unfortunately, the 488.full file shows that there are a lot of hidden
files left over in the filesystem, with incorrect link counts.  Tracing
fuse_request_* shows that there are a large number of FUSE_RELEASE
commands that are queued up on behalf of the unlinked files at the time
that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
aborted, the fuse server would have responded to the RELEASE commands by
removing the hidden files; instead they stick around.

Create a function to push all the background requests to the queue and
then wait for the number of pending events to hit zero, and call this
before fuse_abort_conn.  That way, all the pending events are processed
by the fuse server and we don't end up with a corrupt filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

(imported from commit d4262f9)
This is a preparation to allow fuse-io-uring bg queue
flush from flush_bg_queue()

This does two function renames:
fuse_uring_flush_bg -> fuse_uring_flush_queue_bg
fuse_uring_abort_end_requests -> fuse_uring_flush_bg

And fuse_uring_abort_end_queue_requests() is moved to
fuse_uring_stop_queues().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit e70ef24)
This is useful to have a unique API to flush background requests.
For example when the bg queue gets flushed before
the remaining of fuse_conn_destroy().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit fc4120c)
When calling the fuse server with a dlm request and the fuse server
responds with some other error than ENOSYS most likely the lock size
will be set to zero. In that case the kernel will abort the fuse
connection. This is completely unnecessary.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 0bc2f9c)
Check whether dlm is still enabled when interpreting the returned
error from fuse server.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit f6fbf7c)
Sometimes the file offset alignment needs to be opt-in to achieve the
optimum performance at the backend store.

For example when ErasureCode [1] is used at the backend store, the
optimum write performance is achieved when the WRITE request is aligned
with the stripe size of ErasureCode.  Otherwise a non-aligned WRITE
request needs to be split at the stripe size boundary.  It is quite
costly to handle these split partial requests, as firstly the whole
stripe to which the split partial request belongs needs to be read out,
then overwrite the read stripe buffer with the request, and finally write
the whole stripe back to the persistent storage.

Thus the backend store can suffer severe performance degradation when
WRITE requests can not fit into one stripe exactly.  The write performance
can be 10x slower when the request is 256KB in size given 4MB stripe size.
Also there can be 50% performance degradation in theory if the request
is not stripe boundary aligned.

Besides, the conveyed test indicates that, the non-alignment issue
becomes more severe when decreasing fuse's max_ratio, maybe partly
because the background writeback now is more likely to run parallelly
with the dirtier.

fuse's max_ratio	ratio of aligned WRITE requests
----------------	-------------------------------
70			99.9%
40			74%
20			45%
10			20%

With the patched version, which makes the alignment constraint opt-in
when constructing WRITE requests, the ratio of aligned WRITE requests
increases to 98% (previously 20%) when fuse's max_ratio is 10.

fuse: fix alignment to work with redfs ubuntu

- small fix to make the fuse alignment patch work with redfs ubuntu 6.8.x
- add writeback_control to fuse_writepage_need_send() to make
more accurate decisions about when to skip sending data
- fix shift number for FUSE_ALIGN_PG_ORDER
- remove test code

[1] https://lore.kernel.org/linux-fsdevel/20240124070512.52207-1-jefflexu@linux.alibaba.com/T/#m9bce469998ea6e4f911555c6f7be1e077ce3d8b4
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 5e590a6)
- Increase the possible lock size to 64 bit.
- change semantics of DLM locks to request start and end
- change semantics of DLM request return to mark start
and end of the locked area
- better prepare dlm lock range cache rb-tree
for unaligned byte range locks which could return
any value as long as it is larger than the range
requested
- add the case where start and end are zero
to destroy the cache

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 87968c7)
Fix reference count leak of payload pages during fuse argument copies.

Signed-off-by: Cheng Ding <cding@ddn.com>

(imported from commit 8b75cf0)
This is another preparation and will be used for decision
which queue to add a request to.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>

(imported from commit e4698fa)
This is preparation for follow up commits that allow to run with a
reduced number of queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 2e27c33)
Add per-CPU and per-NUMA node bitmasks to track which
io-uring queues are registered.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit be6edce)
Queues selection (fuse_uring_get_queue) can handle reduced number
queues - using io-uring is possible now even with a single
queue and entry.

The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that
reduced queues are possible, i.e. if the flag is set, fuse server
is free to reduce number queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit f620f3d)
Running background IO on a different core makes quite a difference.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
 --direct=1

unpatched
   READ: bw=272MiB/s (285MB/s) ...
patched
   READ: bw=650MiB/s (682MB/s)

Reason is easily visible, the fio process is migrating between CPUs
when requests are submitted on the queue for the same core.

With --iodepth=8

unpatched
   READ: bw=466MiB/s (489MB/s)
patched
   READ: bw=641MiB/s (672MB/s)

Without io-uring (--iodepth=8)
   READ: bw=729MiB/s (764MB/s)

Without fuse (--iodepth=8)
   READ: bw=2199MiB/s (2306MB/s)

(Test were done with
<libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
[-o io_uring] /tmp/source /tmp/dest
)

Additional notes:

With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
   READ: bw=903MiB/s (946MB/s)

With just a random qid (--iodepth=8)
   READ: bw=429MiB/s (450MB/s)

With --iodepth=1
unpatched
   READ: bw=195MiB/s (204MB/s)
patched
   READ: bw=232MiB/s (243MB/s)

With --iodepth=1 --numjobs=2
unpatched
   READ: bw=366MiB/s (384MB/s)
patched
   READ: bw=472MiB/s (495MB/s)

With --iodepth=1 --numjobs=8
unpatched
   READ: bw=1437MiB/s (1507MB/s)
patched
   READ: bw=1529MiB/s (1603MB/s)
fuse without io-uring
   READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
no-fuse
   READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...

In summary, for async requests the core doing application IO is busy
sending requests and processing IOs should be done on a different core.
Spreading the load on random cores is also not desirable, as the core
might be frequency scaled down and/or in C1 sleep states. Not shown here,
but differnces are much smaller when the system uses performance govenor
instead of schedutil (ubuntu default). Obviously at the cost of higher
system power consumption for performance govenor - not desirable either.

Results without io-uring (which uses fixed libfuse threads per queue)
heavily depend on the current number of active threads. Libfuse uses
default of max 10 threads, but actual nr max threads is a parameter.
Also, no-fuse-io-uring results heavily depend on, if there was already
running another workload before, as libfuse starts these threads
dynamically - i.e. the more threads are active, the worse the
performance.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit c6399ea)
This is to further improve performance.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
--direct=1

unpatched
   READ: bw=650MiB/s (682MB/s)
patched:
   READ: bw=995MiB/s (1043MB/s)

with --iodepth=8

unpatched
   READ: bw=641MiB/s (672MB/s)
patched
   READ: bw=966MiB/s (1012MB/s)

Reason is that with --iodepth=x (x > 1) fio submits multiple async
requests and a single queue might become CPU limited. I.e. spreading
the load helps.

(imported from commit 2e73b0b)
generic_file_direct_write() also does this and has a large
comment about.

Reproducer here is xfstest's generic/209, which is exactly to
have competing DIO write and cached IO read.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 9e04c8a)
This was done as condition on direct_io_allow_mmap, but I believe
this is not right, as a file might be open two times - once with
write-back enabled another time with FOPEN_DIRECT_IO.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit dfca338)
With the reduced queue feature io-uring is marked as ready after
receiving the 1st ring entry. At this time other queues just
might be in the process of registration and then a race happens

fuse_uring_queue_fuse_req -> no queue entry registered yet
    list_add_tail -> fuse request gets queued

So far fetching requests from the list only happened from
FUSE_IO_URING_CMD_COMMIT_AND_FETCH, but without new requests
on the same queue, it would actually never send requests
from that queue - the request was stuck.

(imported from commit 3bfb6cd)
hbirth and others added 17 commits February 21, 2026 19:26
fuse.h: add new opcode FUSE_COMPOUND

fuse_compound.c: add new functionality to pack multiple
fuse operations into one compound command

file.c: add an implementation of open+getattr

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit d9e7351)
Simplify fuse_compound_req to hold only the pointers
to the added fuse args and the request housekeeping.

Simplify open+getattr call by using helper functions
to fill out the fuse request parameters

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit 1607a03)
(imported from commit 9df5e4c)
(imported from commit 9921bcd)
(imported from commit 09d6f59)
There was a race between fuse_uring_cancel() and
fuse_uring_register()/fuse_uring_next_fuse_req(),
which comes from the queue reduction feature.

Race was

core-A                         core-B
fuse_uring_register
    spin_lock(&queue->lock);
    fuse_uring_ent_avail()
    spin_unlock(&queue->lock);

                                fuse_uring_cancel()
                                    spin_lock(&queue->lock);
                                    ent->state = FRRS_USERSPACE;
                                    list_move()

    fuse_uring_next_fuse_req()
        spin_lock(&queue->lock);
        fuse_uring_ent_avail(ent, queue);
        fuse_uring_send_next_to_ring()
        spin_unlock(&queue->lock);
        fuse_uring_send_next_to_ring

I.e. fuse_uring_ent_avail() was called two times and the 2nd time
when the entry was actually already handled by fuse_uring_cancel().

Solution is to not call fuse_uring_ent_avail() from
fuse_uring_register. With that the entry is not in state
FRRS_AVAILABLE and fuse_uring_cancel() will not touch it.
fuse_uring_send_next_to_ring() will mark it as FRRS_AVAILABLE,
and then either assign a request to it and change state again
or will not touch it at all anymore - race fixed.

This will be folded into the upstream queue reduction patches
and therefore has the RED-34640 commit message.

Also entirely removed is fuse_uring_do_register() as remaining
work can be done by the caller.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 932feba)
This is just to avoid code dup with an upcoming commit.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit ec3217f)
This issue could be observed sometimes during libfuse xfstests, from
dmseg prints some like "kernel: WARNING: CPU: 4 PID: 0 at
fs/fuse/dev_uring.c:204 fuse_uring_destruct+0x1f5/0x200 [fuse]".

The cause is, if when fuse daemon just submitted
FUSE_IO_URING_CMD_REGISTER SQEs, then umount or fuse daemon quits at
this very early stage. After all uring queues stopped, might have one or
more unprocessed FUSE_IO_URING_CMD_REGISTER SQEs get processed then some
new ring entities are created and added to ent_avail_queue, and
immediately fuse_uring_cancel moved them to ent_in_userspace after SQEs
get canceled. These ring entities were not moved to ent_released, and
stayed in ent_in_userspace when fuse_uring_destruct was called.

One way to solve it would be to also free 'ent_in_userspace' in
fuse_uring_destruct(), but from code point of view it is hard to see why
it is needed. As suggested by Joanne, another solution is to avoid moving
entries in fuse_uring_cancel() to the 'ent_in_userspace' list and just
releasing them directly.

Fixes: b6236c8 ("fuse: {io-uring} Prevent mount point hang on fuse-server termination")
Cc: Joanne Koong <joannelkoong@gmail.com>
Cc: <stable@vger.kernel.org> # v6.14
Signed-off-by: Jian Huang Li <ali@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit 30d0473)
When a request is terminated before it has been committed, the request
is not removed from the queue's list. This leaves a dangling list entry
that leads to list corruption and use-after-free issues.

Remove the request from the queue's list for terminated non-committed
requests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Fixes: c090c8a ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

(imported from commit 07214a0)
This fixes a memory leak.

(imported from commit f75b62f)
FUSE_INIT has always been asynchronous with mount.  That means that the
server processed this request after the mount syscall returned.

This means that FUSE_INIT can't supply the root inode's ID, hence it
currently has a hardcoded value.  There are other limitations such as not
being able to perform getxattr during mount, which is needed by selinux.

To remove these limitations allow server to process FUSE_INIT while
initializing the in-core super block for the fuse filesystem.  This can
only be done if the server is prepared to handle this, so add
FUSE_DEV_IOC_SYNC_INIT ioctl, which

 a) lets the server know whether this feature is supported, returning
 ENOTTY othewrwise.

 b) lets the kernel know to perform a synchronous initialization

The implementation is slightly tricky, since fuse_dev/fuse_conn are set up
only during super block creation.  This is solved by setting the private
data of the fuse device file to a special value ((struct fuse_dev *) 1) and
waiting for this to be turned into a proper fuse_dev before commecing with
operations on the device file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

(imported from commit dfb84c3)
no functional changes

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>

(imported from commit f0bccb2)
Take actions on the PR merged event of this repo. Run
copy-from-linux-branch.sh and create a PR for redfs.

(cherry picked from commit f54872e)

(imported from commit 522fddf)
Switch to pull_request_target instead of pull_request as the github
security requirement. Also limits the scope to protected PR.

(cherry picked from commit b9980ad)

(imported from commit e504e4a)
Remove the pull_request_target as it doesn't work.

(cherry picked from commit 5328f66)

(imported from commit 5277386)
For now compounds are a module option and disabled by default

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

(imported from commit f3b301d)
The use of bitmap_weight() didn't give the actual index,
but always returned the current cpu, which resulted
in a totally wrong mapping.

It now just increases a counter for every mapping and ignores
cores not in the given (numa) map and then find the index
for that.

Also added is a pr_debug(), which can be activated for example
with
echo "module redfs +p" >/proc/dynamic_debug/control
(Pity that upstream is not open for such debug messages).

(imported from commit bcbb684)
Fix the include sequence which causes a compiling error on aarch64.

(imported from commit f5fed0e)
Mapping might point to a totally different core due to
random assignment. For performance using the current
core might be beneficial

Example (with core binding)

unpatched WRITE: bw=841MiB/s
patched   WRITE: bw=1363MiB/s

With
fio --name=test --ioengine=psync --direct=1 \
    --rw=write --bs=1M --iodepth=1 --numjobs=1 \
    --filename_format=/redfs/testfile.\$jobnum --size=100G \
    --thread --create_on_open=1 --runtime=30s --cpus_allowed=1

In order to get the good number `--cpus_allowed=1` is needed.
This could be improved by a future change that avoids
cpu migration in fuse_request_end() on wake_up() call.

(imported from commit 32e0073)
This fixes xfstests generic/451 (for both O_DIRECT and FOPEN_DIRECT_IO
direct write).

Commit b359af8 ("fuse: Invalidate the page cache after
FOPEN_DIRECT_IO write") tries to fix the similar issue for
FOPEN_DIRECT_IO write, which can be reproduced by xfstests generic/209.
It only fixes the issue for synchronous direct write, while omitting
the case for asynchronous direct write (exactly targeted by
generic/451).

While for O_DIRECT direct write, it's somewhat more complicated.  For
synchronous direct write, generic_file_direct_write() will invalidate
the page cache after the write, and thus it can pass generic/209.  While
for asynchronous direct write, the invalidation in
generic_file_direct_write() is bypassed since the invalidation shall be
done when the asynchronous IO completes.  This is omitted in FUSE and
generic/451 fails whereby.

Fix this by conveying the invalidation for both synchronous and
asynchronous write.

- with FOPEN_DIRECT_IO
  - sync write,  invalidate in fuse_send_write()
  - async write, invalidate in fuse_aio_complete() with FUSE_ASYNC_DIO,
		 fuse_send_write() otherwise
- without FOPEN_DIRECT_IO
  - sync write,  invalidate in generic_file_direct_write()
  - async write, invalidate in fuse_aio_complete() with FUSE_ASYNC_DIO,
		 generic_file_direct_write() otherwise

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>

(imported from commit f6de786)
@hbirth hbirth force-pushed the redfs-ubuntu-hwe-6.17.0-16.16_24.04.1 branch from 8d8f77b to 43ca84c Compare February 21, 2026 18:36
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants