Skip to content

Conversation

@Maegereg
Copy link
Contributor

@Maegereg Maegereg commented Nov 10, 2025

Covers addition, multiplication, and equality (contains, for arbs) for acb, arb, fmpq, and fmpz.

The primary goal is to use these to measure the performance effect of using the stable API (#338), but they could be useful for other things in the future.

I'm particularly looking for feedback on whether this should include additional types or operations.

@oscarbenjamin
Copy link
Collaborator

Is there some package that could be used for benchmarking here?

Ideally what you want is to be able to compare two different versions to see possible statistically significant differences.

@oscarbenjamin
Copy link
Collaborator

The failed CI job is possibly due to the Cython constraint and might be fixed after gh-350.

@Maegereg
Copy link
Contributor Author

Is there some package that could be used for benchmarking here?

I was initially assuming that we'd want to follow the philosophy of the tests and keep things pretty minimal. But I've done a bit of research now, and it looks like pyperf could be useful here - it has good support for running a suite of benchmarks, and comparing multiple runs (which would allow us to get comparisons between different builds of the library). We'd still need either some manual effort to set up the different builds in different environments, or some scripting on top of pyperf to automate that a little (I was planning to do that anyway in the world where we aren't using pyperf).

If that sounds reasonable to you, I can re-write these benchmarks to use pyperf. I plan to leave the scaffolding for handling multiple builds to a future PR, so that right now we can focus on whether these are the right things to measure.

@Maegereg
Copy link
Contributor Author

I went ahead wrote up a version that uses pyperf.

@oscarbenjamin
Copy link
Collaborator

Sorry this dropped off my radar

@oscarbenjamin
Copy link
Collaborator

It has taken me a while to figure out how to actually run the benchmarks in my dev setup but it is

spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH

This is because I am using environment variables to make libflint.so available to the runtime linker but pyperf by default drops environment variables when launching the subprocesses that actually run the benchmarks.

When I run the benchmarks I see these warnings in the output for each case:

$ spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (47.8 ns) is 12% of the mean (389 ns)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

acb addition: Mean +- std dev: 389 ns +- 48 ns
.....................
WARNING: the benchmark result may be unstable
* Not enough samples to get a stable result (95% certainly of less than 1% variation)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.
...

Is there a way to write the benchmarking code differently so that the results are considered to be more reliable?

They can be suppressed with --quiet so for now I'll use that and we have

$ meson setup build --reconfigure -Dbuildtype=release
$ spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 173 ns +- 2 ns
acb contains: Mean +- std dev: 566 ns +- 6 ns
acb multiplication: Mean +- std dev: 165 ns +- 21 ns
arb addition: Mean +- std dev: 138 ns +- 2 ns
arb contains: Mean +- std dev: 1.06 us +- 0.01 us
arb multiplication: Mean +- std dev: 133 ns +- 1 ns
fmpq addition: Mean +- std dev: 184 ns +- 28 ns
fmpq equality: Mean +- std dev: 342 ns +- 6 ns
fmpq multiplication: Mean +- std dev: 208 ns +- 4 ns
fmpz addition: Mean +- std dev: 92.7 ns +- 0.9 ns
fmpz equality: Mean +- std dev: 93.1 ns +- 1.2 ns
fmpz multiplication: Mean +- std dev: 97.4 ns +- 6.0 ns

Then this is using the stable ABI v3.12:

$ meson setup build --reconfigure -Dbuildtype=release -Dpython.allow_limited_api=true -Dlimited_api_version=3.12
$ spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 236 ns +- 42 ns
acb contains: Mean +- std dev: 573 ns +- 17 ns
acb multiplication: Mean +- std dev: 197 ns +- 11 ns
arb addition: Mean +- std dev: 171 ns +- 25 ns
arb contains: Mean +- std dev: 1.05 us +- 0.01 us
arb multiplication: Mean +- std dev: 159 ns +- 14 ns
fmpq addition: Mean +- std dev: 231 ns +- 16 ns
fmpq equality: Mean +- std dev: 464 ns +- 4 ns
fmpq multiplication: Mean +- std dev: 265 ns +- 12 ns
fmpz addition: Mean +- std dev: 130 ns +- 9 ns
fmpz equality: Mean +- std dev: 99.8 ns +- 7.4 ns
fmpz multiplication: Mean +- std dev: 141 ns +- 13 ns

(Side note that I needed to do rm -r build-install/ when switching to the stable ABI because otherwise you end up with both kinds of extension modules and CPython prefers to import the non-stable-ABI ones at import time.)

This is the stable ABI v3.9:

$ meson setup build --reconfigure -Dbuildtype=release -Dpython.allow_limited_api=true -Dlimited_api_version=3.9
$ spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 195 ns +- 8 ns
acb contains: Mean +- std dev: 545 ns +- 10 ns
acb multiplication: Mean +- std dev: 182 ns +- 16 ns
arb addition: Mean +- std dev: 165 ns +- 10 ns
arb contains: Mean +- std dev: 1.05 us +- 0.03 us
arb multiplication: Mean +- std dev: 152 ns +- 7 ns
fmpq addition: Mean +- std dev: 206 ns +- 10 ns
fmpq equality: Mean +- std dev: 451 ns +- 70 ns
fmpq multiplication: Mean +- std dev: 247 ns +- 12 ns
fmpz addition: Mean +- std dev: 120 ns +- 10 ns
fmpz equality: Mean +- std dev: 94.1 ns +- 3.2 ns
fmpz multiplication: Mean +- std dev: 117 ns +- 7 ns

Those timings are all quite noisy. I haven't done a systematic analysis of statistical significance but it does look like the stable ABI gives an average slowdown for these micro-operations with stable ABI being maybe about 20% slower overall. I don't see a clear difference between using the 3.9 vs 3.12 version of the stable ABI (the Cython docs say that using 3.12 can make somethings faster). Probably with something bigger like an arb_mat it would be less noticeable but for something like fmpz(2)+fmpz(3) the overheads here are noticeable.

Further investigation could be done especially running the timings again and on a different computer because this is an old not powerful computer. Assuming there is just a 20% slowdown I think what that means is that in general we don't want to just use the stable ABI for all of the wheels uploaded to PyPI. We could however do something hybrid like using the stable ABI for less common platforms or for older Python versions.

CC @da-woods who may be interested to know about the Cython+stable-ABI timings.

@da-woods
Copy link

CC @da-woods who may be interested to know about the Cython+stable-ABI timings.

Thanks - the 20% numbers look broadly similar to what we've measured for Cython itself (although we see a little more version dependence).

It doesn't look like any of your benchmarks are especially outliers so I don't think there's anything specific that's performing badly here.

Side note: I'm a little worried that the future Python 3.15 free-threading-compatible Stable ABI will be more expensive and too much of a performance loss for most people. But that's obviously a future problem.

We could however do something hybrid like using the stable ABI for less common platforms

That's what we've done (although mostly as a "eat your own dog food type thing rather than because it's really necessary).

@oscarbenjamin
Copy link
Collaborator

the 20% numbers look broadly similar to what we've measured for Cython itself (although we see a little more version dependence).

The main difference here is that Cython is all written in Python and compiled by itself whereas here we are just wrapping a C library. The Cython code used is just to bridge from Python into C and ultimately just calls a C function e.g. this is fmpz.__add__:

def __add__(s, t):
cdef fmpz_struct tval[1]
cdef int ttype = FMPZ_UNKNOWN
u = NotImplemented
ttype = fmpz_set_any_ref(tval, t)
if ttype != FMPZ_UNKNOWN:
u = fmpz.__new__(fmpz)
fmpz_add((<fmpz>u).val, (<fmpz>s).val, tval)
if ttype == FMPZ_TMP:
fmpz_clear(tval)
return u

So we're just measuring all of the overhead that happens to check the type of the input and allocate memory for the output before we get to calling the fmpz_add C function. It isn't clear to me exactly what the limited API does that would make that particular method 40% slower as indicated in the timings.

In gh-35 I added methods like __radd__ because I thought that Cython 3 needed them but in retrospect I think that this is maybe adding overhead compared to using c_api_binop_methods=True :

https://cython.readthedocs.io/en/latest/src/userguide/special_methods.html#arithmetic-methods

I assume that is at least partly responsible for this being slower than gmpy2 which uses C directly rather than Cython (I think int is faster because it caches small ints):

In [4]: ai, bi = 2, 3

In [5]: ag, bg = gmpy2.mpz(2), gmpy2.mpz(3)

In [6]: af, bf = flint.fmpz(2), flint.fmpz(3)

In [7]: %timeit ai+bi
35.2 ns ± 0.398 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [8]: %timeit ag+bg
85.9 ns ± 0.209 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: %timeit af+bf
142 ns ± 0.363 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

I'm a little worried that the future Python 3.15 free-threading-compatible Stable ABI will be more expensive and too much of a performance loss for most people.

I think that a lot of Python libraries that wrap a C library tend to wrap larger operations like multiplying large arrays or something but in python-flint's case some of the important objects are really small and then we're really trying to call a tiny C function and the runtime is all dominated by the overhead around that C call. I suspect that for most other libraries they are either calling something more expensive in C or the operation is just not something that is likely to be done millions of time in a loop. In that case I think that the overheads we are talking about here don't matter.

@da-woods
Copy link

My first guess would be that the difference is in the construction/destruction of the extension types (e.g. u = fmpz.__new__(fmpz) ). I'd need to have a proper look to be certain though. I'll try to have look in the next few days and see if there's anything obvious.

@oscarbenjamin
Copy link
Collaborator

I tried rerunning the timings just to be a bit surer. I was a bit more disciplined about not using the computer while the benchmarks were running and the standard deviations are smaller this time.

This is the normal build:

$  meson setup build --reconfigure -Dbuildtype=release
$  spin test
$  spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 174 ns +- 6 ns
acb contains: Mean +- std dev: 566 ns +- 8 ns
acb multiplication: Mean +- std dev: 161 ns +- 5 ns
arb addition: Mean +- std dev: 141 ns +- 6 ns
arb contains: Mean +- std dev: 1.08 us +- 0.07 us
arb multiplication: Mean +- std dev: 133 ns +- 4 ns
fmpq addition: Mean +- std dev: 174 ns +- 2 ns
fmpq equality: Mean +- std dev: 342 ns +- 5 ns
fmpq multiplication: Mean +- std dev: 207 ns +- 3 ns
fmpz addition: Mean +- std dev: 94.0 ns +- 4.9 ns
fmpz equality: Mean +- std dev: 93.3 ns +- 1.3 ns
fmpz multiplication: Mean +- std dev: 96.6 ns +- 2.3 ns

This is stable ABI v3.12:

$  meson setup build --reconfigure -Dbuildtype=release -Dpython.allow_limited_api=true -Dlimited_api_version=3.12
$  spin test
$  spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 238 ns +- 37 ns
acb contains: Mean +- std dev: 563 ns +- 3 ns
acb multiplication: Mean +- std dev: 215 ns +- 30 ns
arb addition: Mean +- std dev: 169 ns +- 25 ns
arb contains: Mean +- std dev: 1.05 us +- 0.01 us
arb multiplication: Mean +- std dev: 156 ns +- 3 ns
fmpq addition: Mean +- std dev: 241 ns +- 25 ns
fmpq equality: Mean +- std dev: 467 ns +- 13 ns
fmpq multiplication: Mean +- std dev: 270 ns +- 12 ns
fmpz addition: Mean +- std dev: 125 ns +- 1 ns
fmpz equality: Mean +- std dev: 95.2 ns +- 0.7 ns
fmpz multiplication: Mean +- std dev: 130 ns +- 10 ns

My first guess would be that the difference is in the construction/destruction of the extension types (e.g. u = fmpz.__new__(fmpz)

That seems right to me. With these new timings it is a bit clearer which things are slower and which are not affected. Methods like contains and equality are mostly unchanged. These are methods that return True or False. Methods that create new extension objects (addition and multiplication) are consistently slower with the stable ABI.

The outlier is the "fmpq equality" benchmark that seems to be consistently slower in the stable ABI even though it only returns a bool. The code for that one is here:

def __richcmp__(s, t, int op):
cdef bint res
s = any_as_fmpq(s)
if s is NotImplemented:
return s
t = any_as_fmpq(t)
if t is NotImplemented:
return t
if op == 2 or op == 3:
res = fmpq_equal((<fmpq>s).val, (<fmpq>t).val)
if op == 3:
res = not res
return res

My guess is that any_as_fmpq slows down for some reason:
cdef int fmpq_set_any_ref(fmpq_t x, obj):
cdef int status
fmpq_init(x)
if typecheck(obj, fmpq):
x[0] = (<fmpq>obj).val[0]
return FMPZ_REF
if typecheck(obj, fmpz):
fmpz_set(fmpq_numref(x), (<fmpz>obj).val)
fmpz_one(fmpq_denref(x))
return FMPZ_TMP
status = fmpz_set_any_ref(fmpq_numref(x), obj)
if status != FMPZ_UNKNOWN:
fmpz_one(fmpq_denref(x))
return FMPZ_TMP
fmpq_clear(x)
return FMPZ_UNKNOWN
cdef any_as_fmpq(obj):
cdef fmpq_t x
cdef int status
cdef fmpq q
status = fmpq_set_any_ref(x, obj)
if status == FMPZ_REF:
q = fmpq.__new__(fmpq)
fmpq_set(q.val, x)
return q
elif status == FMPZ_TMP:
q = fmpq.__new__(fmpq)
fmpq_clear(q.val)
q.val[0] = x[0]
return q
else:
return NotImplemented

Note that in the benchmarks both operands are of the same type so we should be taking all the fast paths.

But wait there is a bug! It is calling fmpq.__new__(fmpq) even when the input was already an fmpq (status == FMPZ_REF).

We can fix that:

diff --git a/src/flint/types/fmpq.pyx b/src/flint/types/fmpq.pyx
index ef4fdb5..5cf06a3 100644
--- a/src/flint/types/fmpq.pyx
+++ b/src/flint/types/fmpq.pyx
@@ -19,7 +19,6 @@ cdef int fmpq_set_any_ref(fmpq_t x, obj):
     cdef int status
     fmpq_init(x)
     if typecheck(obj, fmpq):
-        x[0] = (<fmpq>obj).val[0]
         return FMPZ_REF
     if typecheck(obj, fmpz):
         fmpz_set(fmpq_numref(x), (<fmpz>obj).val)
@@ -38,9 +37,7 @@ cdef any_as_fmpq(obj):
     cdef fmpq q
     status = fmpq_set_any_ref(x, obj)
     if status == FMPZ_REF:
-        q = fmpq.__new__(fmpq)
-        fmpq_set(q.val, x)
-        return q
+        return obj
     elif status == FMPZ_TMP:
         q = fmpq.__new__(fmpq)
         fmpq_clear(q.val)

All tests pass and now new stable ABI v3.12 timings

$  meson setup build --reconfigure -Dbuildtype=release -Dpython.allow_limited_api=true -Dlimited_api_version=3.12
$  spin test
$  spin run python benchmarks/simple_benchmarks.py --inherit-environ=PYTHONPATH,LD_LIBRARY_PATH --quiet
acb addition: Mean +- std dev: 253 ns +- 48 ns
acb contains: Mean +- std dev: 565 ns +- 6 ns
acb multiplication: Mean +- std dev: 205 ns +- 21 ns
arb addition: Mean +- std dev: 169 ns +- 24 ns
arb contains: Mean +- std dev: 1.05 us +- 0.01 us
arb multiplication: Mean +- std dev: 155 ns +- 2 ns
fmpq addition: Mean +- std dev: 158 ns +- 18 ns
fmpq equality: Mean +- std dev: 153 ns +- 1 ns
fmpq multiplication: Mean +- std dev: 181 ns +- 1 ns
fmpz addition: Mean +- std dev: 126 ns +- 1 ns
fmpz equality: Mean +- std dev: 97.0 ns +- 4.0 ns
fmpz multiplication: Mean +- std dev: 135 ns +- 8 ns

Now fmpq equality is much faster to the point that it doesn't even make sense to compare it with the previous non-stable-ABI timings. That is possibly not the correct fix but something like it could be used.

In any case all of the observed timings are now consistent with the hypothesis that all the slowdown we observe is just T.__new__(T) being slower and that if that were somehow solved then we might not be seeing any slowdown at all in these benchmarks.

@da-woods
Copy link

I did a bit of investigating of T.__new__(T). There's definitely a difference in both allocation and deallocation in Stable ABI mode. I didn't find anything useful to change to make it better though unfortunately.

@oscarbenjamin
Copy link
Collaborator

Thanks for looking into it.

Do you know if the difference is to do with Cython not using some CPython internals or it is just that defining Py_LIMITED_API makes some particular C API call slower on CPython's side?

I assume that we are still just calling tp_new so I'm not sure how that slows down. This is the code that is generated for tp_new:

static PyObject *__pyx_tp_new_5flint_5types_4fmpz_fmpz(PyTypeObject *t, PyObject *a, PyObject *k) {
  PyObject *o = __Pyx_PyType_GetSlot(__pyx_mstate_global->__pyx_ptype_5flint_10flint_base_10flint_base_flint_scalar, tp_new, newfunc)(t, a, k);
  if (unlikely(!o)) return 0;
  if (unlikely(__pyx_pw_5flint_5types_4fmpz_4fmpz_1__cinit__(o, __pyx_mstate_global->__pyx_empty_tuple, NULL) < 0)) goto bad;
  return o;
  bad:
  Py_DECREF(o); o = 0;
  return NULL;
}

static int __pyx_pw_5flint_5types_4fmpz_4fmpz_1__cinit__(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  CYTHON_UNUSED Py_ssize_t __pyx_nargs;
  CYTHON_UNUSED PyObject *const *__pyx_kwvalues;
  int __pyx_r;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("__cinit__ (wrapper)", 0);
  #if CYTHON_ASSUME_SAFE_SIZE
  __pyx_nargs = PyTuple_GET_SIZE(__pyx_args);
  #else
  __pyx_nargs = PyTuple_Size(__pyx_args); if (unlikely(__pyx_nargs < 0)) return -1;
  #endif
  __pyx_kwvalues = __Pyx_KwValues_VARARGS(__pyx_args, __pyx_nargs);
  if (unlikely(__pyx_nargs > 0)) { __Pyx_RaiseArgtupleInvalid("__cinit__", 1, 0, 0, __pyx_nargs); return -1; }
  const Py_ssize_t __pyx_kwds_len = unlikely(__pyx_kwds) ? __Pyx_NumKwargs_VARARGS(__pyx_kwds) : 0;
  if (unlikely(__pyx_kwds_len < 0)) return -1;
  if (unlikely(__pyx_kwds_len > 0)) {__Pyx_RejectKeywords("__cinit__", __pyx_kwds); return -1;}
  __pyx_r = __pyx_pf_5flint_5types_4fmpz_4fmpz___cinit__(((struct __pyx_obj_5flint_5types_4fmpz_fmpz *)__pyx_v_self));

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static int __pyx_pf_5flint_5types_4fmpz_4fmpz___cinit__(struct __pyx_obj_5flint_5types_4fmpz_fmpz *__pyx_v_self) {
  int __pyx_r;

  /* "flint/types/fmpz.pyx":68
 * 
 *     def __cinit__(self):
 *         fmpz_init(self.val)             # <<<<<<<<<<<<<<
 * 
 *     def __dealloc__(self):
*/
  fmpz_init(__pyx_v_self->val);

  /* "flint/types/fmpz.pyx":67
 *     """
 * 
 *     def __cinit__(self):             # <<<<<<<<<<<<<<
 *         fmpz_init(self.val)
 * 
*/

  /* function exit code */
  __pyx_r = 0;
  return __pyx_r;
}

All of that comes from

def __cinit__(self):
fmpz_init(self.val)

@da-woods
Copy link

Do you know if the difference is to do with Cython not using some CPython internals or it is just that defining Py_LIMITED_API makes some particular C API call slower on CPython's side?

I tried just setting CYTHON_LIMITED_API rather than Py_LIMITED_API. That switches Cython's code-generation to Limited API mode, but doesn't affect the Python headers (so you get all the inlining/macros). (Note that this is not a useful build mode for users but occasionally useful for testing things like this).

The upshot was that it performs about the same as the "true" Limited API. So it isn't "that defining Py_LIMITED_API makes some particular C API call slower on CPython's side"


I was a bit suspicious of PyType_GetSlot as a level of indirection. I thought I'd identified some ways of replacing it (in a lot of cases we know exactly what the result is going to be) and they didn't help. Admittedly I don't think they'd have applied to your code though.

I also had a bit of a look at profiling it at C/assembly level with perf and there wasn't anything that dramatically stood out.

@oscarbenjamin
Copy link
Collaborator

Thanks for looking into it and I see you have now opened cython/cython#7482 about this.

For python-flint specifically I'm now wondering if it would make sense to make a cdef function to be used as a replacement for fmpz.__new__(fmpz) since I see now that Cython is generating somewhat complicated code for that operation (although I assume some of that gets preprocessed out).

It might be that we can make this faster both with and without the stable ABI and that the differences observed here would also go away.

I tried this which seems to pass tests although I guess it should check for NULL:

diff --git a/src/flint/types/fmpz.pxd b/src/flint/types/fmpz.pxd
index f48b059..2f5291f 100644
--- a/src/flint/types/fmpz.pxd
+++ b/src/flint/types/fmpz.pxd
@@ -1,9 +1,10 @@
 from cpython.long cimport PyLong_Check
+from cpython.type cimport PyType_GenericAlloc
 from flint.flint_base.flint_base cimport flint_scalar
 from flint.utils.conversion cimport chars_from_str
 from flint.flintlib.types.flint cimport slong, pylong_as_slong
 from flint.flintlib.types.flint cimport PyObject
-from flint.flintlib.functions.fmpz cimport fmpz_t, fmpz_set_str, fmpz_set_si
+from flint.flintlib.functions.fmpz cimport fmpz_t, fmpz_set_str, fmpz_set_si, fmpz_init
 
 cdef int fmpz_set_any_ref(fmpz_t x, obj)
 cdef fmpz_get_intlong(fmpz_t x)
@@ -34,3 +35,8 @@ cdef class fmpz(flint_scalar):
 
     """
     cdef fmpz_t val
+
+cdef inline object _new_fmpz():
+    cdef fmpz res = PyType_GenericAlloc(fmpz, 0)
+    fmpz_init(res.val)
+    return res
diff --git a/src/flint/types/fmpz.pyx b/src/flint/types/fmpz.pyx
index 6f3cf67..617df7a 100644
--- a/src/flint/types/fmpz.pyx
+++ b/src/flint/types/fmpz.pyx
@@ -196,7 +196,7 @@ cdef class fmpz(flint_scalar):
         u = NotImplemented
         ttype = fmpz_set_any_ref(tval, t)
         if ttype != FMPZ_UNKNOWN:
-            u = fmpz.__new__(fmpz)
+            u = _new_fmpz()
             fmpz_add((<fmpz>u).val, (<fmpz>s).val, tval)
         if ttype == FMPZ_TMP:
             fmpz_clear(tval)

I'm still seeing a significant timing difference with stable vs non-stable ABI though:

In [3]: %timeit af+bf  # stable ABI v3.12
132 ns ± 0.354 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: %timeit af+bf # not stable ABI
107 ns ± 0.836 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

It does look like Cython adds a bunch of stuff in the cdef inline function so maybe this needs to just go down to C and have some cdef extern from * to make a macro.

@da-woods
Copy link

Your small inline function looks like it might be a good option for you.

I think the destruction may be slower as well as the construction - unfortunately you really can't remove that in the same way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants