We currently specialize TO_BOOL for many common types.
This avoids the overhead of API calls, but we still need to load either True or False, then test against True or False
The additional cost of having to load and compare with Py_True and Py_False is expensive for what are often quite simple operations. E.g _TO_BOOL_LIST is 10 instructions (AArch64 linux) but only half of that is performing the comparison.
We can breakdown _TO_BOOL_FOO into _TO_BOOL_BIT_FOO; _BIT_TO_BOOL
and then optimize _BIT_TO_BOOL; _GUARD_IS_TRUE_POP to _GUARD_IS_TRUE_BIT_POP.
Where the "bit" versions produce a single bit boolean (0 for False, 1 for True).
Whereas _TO_BOOL_LIST is 10 instructions, hypothetical _TO_BOOL_BIT_LIST` would only be 5 instructions.
We already optimize _GUARD_IS_TRUE_BIT_POP to _GUARD_BIT_IS_SET_POP reducing the number of machine instructions from 5 to 2, but replacing it with _GUARD_IS_TRUE_BIT_POP would reduce it to a single machine instruction and remove the need for the replication in _GUARD_BIT_IS_SET_POP.
We can also replace many of the comparisons with a "bit" form, e.g. replacing _COMPARE_OP_FLOAT with _COMPARE_OP_BIT_FLOAT would reduce the code size from 19 to 13 instructions (21 to 14 accounting for the following guard as well).
[ Specializing for the actual operation, can further reduce the stencil size to 8 instructions ]
All instructions sizes are for the variant with all inputs in outputs in registers.
We currently specialize
TO_BOOLfor many common types.This avoids the overhead of API calls, but we still need to load either
TrueorFalse, then test againstTrueorFalseThe additional cost of having to load and compare with
Py_TrueandPy_Falseis expensive for what are often quite simple operations. E.g_TO_BOOL_LISTis 10 instructions (AArch64 linux) but only half of that is performing the comparison.We can breakdown
_TO_BOOL_FOOinto_TO_BOOL_BIT_FOO; _BIT_TO_BOOLand then optimize
_BIT_TO_BOOL; _GUARD_IS_TRUE_POPto_GUARD_IS_TRUE_BIT_POP.Where the "bit" versions produce a single bit boolean (0 for False, 1 for True).
Whereas
_TO_BOOL_LISTis 10 instructions, hypothetical _TO_BOOL_BIT_LIST` would only be 5 instructions.We already optimize
_GUARD_IS_TRUE_BIT_POPto_GUARD_BIT_IS_SET_POPreducing the number of machine instructions from 5 to 2, but replacing it with_GUARD_IS_TRUE_BIT_POPwould reduce it to a single machine instruction and remove the need for the replication in_GUARD_BIT_IS_SET_POP.We can also replace many of the comparisons with a "bit" form, e.g. replacing
_COMPARE_OP_FLOATwith_COMPARE_OP_BIT_FLOATwould reduce the code size from 19 to 13 instructions (21 to 14 accounting for the following guard as well).[ Specializing for the actual operation, can further reduce the stencil size to 8 instructions ]
All instructions sizes are for the variant with all inputs in outputs in registers.