Unified integer overflow arithmetic

Document Number:	--
Date:	2024/01/19
Reply-to:	cpp@kaotic.software
Authors:	Tiago Freire
Audience:	SG6, LWG, WG21

Target

C++26

Abstract

This paper serves as a point of discussion to help generate path for unification and standardization of integer arithmetic extensions with overflow behaviour. It is not a proposal draft.

Revision

#	Description
1	Initial draft
2	Corrected some spelling issues and integrated community feedback removing concepts from the discussion

Motivation

In many applications sometimes one has to deal with integer arithmetic with numbers whose width far exceeds that for which the defined standard types can support, or ever will be able to support since the width of the representable numbers are application specific and can grow arbitrarily large (within the finite capabilities of the underlying device).

Algorithms to deal with these are quite trivial, and most CPUs offer a good range of support for the required instructions, but there are no equivalent abstractions that are available in the standard C++. It's a rather cumbersome error prone to implement similar functionality using only C++, resulting in extremely inefficient code for what could often be a couple or even a single line of assembly.

Papers such as [P3018] have been proposed introducing some of the missing functionality, and [P0543] have alredy been accepted and is on track for C++26. However, the problem with these is that they have completely different approaches and styles that are not conducive to writing a standard that is preferably consistent, concise, and practical to use.

I mostly agree with the motivation of [P3018] and its background (so those will not be addressed in this paper), but I disagree with its execution.

I only whish to add to the motivation in light of my experience of having implemented and extensively used some of these operations (in particular the unsigned version of the wide arithmetic family described bellow)

Some Big number operations require a combination of all of the "wide arithmetic" version for a specific type.
CPU's such as x86-64 requires certain operands to be preloaded into specific registers, and this is typically achieved by mov instruction for both preparation and the recovery of the result.
And in many of these cases the output register just so happens to be perfectly aligned with the register where the value needs to be on a follow up operation, making a mov operations unnecessary.

An optimizer aware of this, that is free to re-order independent operations and freely swap commutative operands (such as addition and multiplication) can maximize the suppression of mov instructions, and this is something that is best performed at the compiler level which has a holistic view of the structure and context of the code being generated. A simple library writer that has only view of the function being written can not do this.

The logical relationship between the extended integer arithmetic in proposition can be best summarized with the following table:

Type of operation	Standard	Saturated	Safe overflow	Wide arithmetic
Addition	+	add_sat	add_overflow	add_carry
Subtraction	-	sub_sat	sub_overflow	sub_borrow
Multiplication	*	mul_sat	mul_overflow	mul_wide
Division, Remainder	/, %	div_sat	div_overflow, div_rem_overflow, rem_overflow	div_wide
Type casting	static_cast	saturate_cast	overflow_cast?	N/A

All of these operations relate to how integer operations (+, -, *, /) behave in overflow conditions, each have their own use case "family" that slightly tweak the overflow behaviour, and as such one should expect consistency in naming conventions, parameter types, return style, library header, etc...

The problem with division

Most operations in both mentioned papers are well defined and are uncontroversial regarding their use case, except for division (for the saturated and overflow) families.

In the specification of "div_sat" the author proposes that division by 0 be undefined behaviour, I agree with this as division by 0 is an undefined mathematical operation, and giving it a "saturated" defined value would have been incorrect, it is the responsibility of the user to check for such case and handle accordingly within the context of what they are trying to do. Given that, the 0 divisor case is thus excluded, looking at the strictly unsigned integer types "div_sat" would never express its saturated behaviour making it pointless. The only scenario where the saturated behaviour would be expressed is in the degenerate case (INT_MIN / -1) and nowhere else. Given that:
1. The degenerate case is trivial to check
2. The user already has the responsibility for checking division by 0
It wouldn't be unreasonable to also expect the user to check for the degenerate case, and completely remove the need for "div_sat" from the specification. It doesn't hurt to keep it, but there's also no need to add it for completion.
The author for the proposal of "div_overflow" on the other hand proposes for both the division by 0 and the degenerate case to have a defined behaviour and signal an "overflow".
I however strongly disagree with this, division by 0 does not result in an "overflow", it is mathematically undefined. A numerical operation that has a mathematically valid result but that can't fit in the specified datatype is fundamentally different from an operation that is mathematically invalid. And thus, an implementation of "div_overflow" should also have division by 0 as an undefined. But if division by 0 is undefined behaviour, by the exact same logic as "div_sat", it is as equally unecessary.
The "rem_overflow", regardless of signed type, degenerate cases or even if we consider divisions by 0 as defined behaviour, it never overflows, so this should be removed.
As for "div_rem_overflow", if "div_overflow" can be removed and "rem_overflow" never overflows (and should be removed), a "div_rem_overflow" should thus follow the same fate.
What is useful but isn't mentioned is a fused "div_rem" (division and remainder) operation without expressed overflow behaviour. Many compiler optimizers already (in most cases) perform as if "div_rem" already existed if they see both a trivial division (/) and remainder (%) operations being performed on the same integer operands. A "div_rem" would serve to get this behaviour explicitly and in a more structured way without having to rely on the compiler to optimize this implicitly behind the curtain.
For completeness "div_wide" is not only useful but necessary, this is an integral part in algorithms performing divisions of Big numbers, which can be achieved by chaining multiple "div_wide" together where the remainder of a partial division of high-order bits is used as a the high-bits input for the next "div_wide" instruction in order to complete a full division.
The "div_wide" can overflow in the condition that dividend_high >= divisor (note that division by 0 is a special case where dividend_high >= 0 which is always true), and in particular on x86-64 this instruction traps if this condition is met. No "high" output bits are specified, nor should it, the point of this type of algorithm is to be used as a reciprocal to mul_wide, where high bits if not 0 are typically the remainder of a previous division for the same divisor (which is guaranteed to be smaller than the divisor). For that reason, when dividend_high >= divisor the behaviour should be undefined.

Additional instruction

There is however an additional property in integer multiplication that is quite useful for example to reduce complexity in Big number multiplication.
When multiplying two integer the result will be able to fit in an integer with double the width and still have enough room to add at least 2 additional integers without overflowing.
This allows to perform an integer fused multiply add while only needing to output 2 integers for high and low bits. Thus I would also like to propose the addition of mul_add_wide to the standard.

Within the overflow family, when casting a higher width type to a smaller width type the cast can discard bits leading to an "overflow". While a cast itself with an optional overflow bit would be awkward to use (given that it would require 2 output types, the casted value and the overflow bit), asking the question "if I static cast this value between these 2 types would I lose information?" is a perfectly valid thing to do.

The final family table should look like this:

Type of operation	Standard	Saturated	Safe overflow	Wide arithmetic
Addition	+	add_sat	add_overflow	add_carry
Subtraction	-	sub_sat	sub_overflow	sub_borrow
Multiplication	*	mul_sat	mul_overflow	mul_wide, mul_add_wide
Division, Remainder	/, %, div_rem	div_sat*	div_overflow, div_rem_overflow	div_wide
Type casting	static_cast	saturate_cast	would_cast_overflow	N/A

* maybe removed, doesn't hurt to have but also not worth keeping

Library header

I agree with the approach presented by [P0543] which adds the new functionality to the <numeric> library.

These are functions that performs numerical operations, a new library is not required.

Overloads, Function Templates, or Named Functions

[P0543] proposes function templates while [P3018] proposes specific named functions. I happen to agree that the use of template arguments is preferable over the alternatives for the following reasons:

In contexts where we are working with template types, it would be much easier to just use the name of the function and let the type system figure out which "specialization" to used based on the types being provided. This excludes named functions as it would require to create specific overloads depending on type being used or create my own facility that wraps around these types that does the same job. If I have to create my own facility, and that is what actually gets used in replacement of the standard, then why not make that the standard?
Using specific names for each type can be error prone, as one not only needs to be careful and specific with the type selection within the context being used. Which increases the risk of picking the wrong type, or have input types not match, and then have input types being silently promoted without warning to do something unintended. If input types are not consistent it should definitely be a compiler error, and if they disagree the user should be forced to explicitly cast the inputs to the intended type before proceeding.
There's not a 1 to 1 correspondence between all integer types and all types of bit-width that integers can have (a more detailed explanation bellow), and not all types of integers are mandatory to be available in all platforms. Just stating in the standard that the input parameter is a templated type, and that the only allowed templated types must be integers is much simpler in order to ensure that all intended cases are covered, while named or overloaded definition would require a disclaimer regarding which integers are supported and which versions are available depending on the combination of integers available for that platform.
If you require a specific instance of the function for a specific type to be explicitly declared this can not be easily done with overloads, but can still be easily done with templates as one can just explicitly declare the exact templated type and this happens in a much more natural language e.x. mul_wide<uint32_t> is much more explicit than ui32_mul_wide.

Using function templates is just better, as in many circumstances it allows you the option to eat your cake or keep it.

However, there is an elephant in the room that one needs to acknowledge.

The standard and most platforms define 4 different integer widths (8bit, 16bit, 32bit, and 64bit) with signed and unsigned versions of it (for a total of 8), but also that there are 10 different distinct fundamental integer types (signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, long long, unsigned long long).
The observant among you will quickly notice that 10 ≠ 8. The way this is dealt with is that one of the fundamental types is chosen to have exactly the same width as another one (of all of those the char type can not be one of them). And this leaves us with the sad scenario where we can have T1 and T2 where:

is_integral_v<T1> && is_integral_v<T2> && !is_same_v<remove_cv_t<T1>, bool> && !is_same_v<remove_cv_t<T2>, bool>

				&& (is_signed_v<T1> == is_signed_v<T2>)

				&& (size_of(T1) == size_of(T2))

				&& !is_same_v<remove_cv_t<T1>, remove_cv_t<T2>>

evaluates to true. If two integers are the same width, have the same signed'ness, and in all aspects behave in the exact same way in what capacity are they different?

This often leads to library implementers having to acknowledge this problem, provide 2 additional implementations than what they should have, and one of those implementations is just aliasing one of the types to another. Can we agree to get rid this?

One might naively think that a solution to this would be to simply make the (uint8_t, uint16_t, uint32_t, uint64_t, int8_t, int16_t, int32_t, int64_t) the fundamental types, and alias all other integer types to pick one of these for backwards compatibility reasons. And although I would agree with this in spirit (and should at one point happen), this is not an easy thing to accommodate due to the name mangling rules. Yes, somebody needs to fix this, there are ways to do it, it is not easy, but not the point of this paper.

The point of this problem is that the set of 8 currently is an alias to a selection from the set of 10, and the set of 10 should be supported and not just the 8.
This is easily done with either overloads or template parameters, but extremely messy if distinct function names are used instead.

TLDR: Template parameters should be used

Extra classifications

All functions have "well" defined behaviour, and they can all be made constexpr and noexcept. There's a clear benefit to computing things at compile time if possible, and if one can optimize assuming that they don't throw exceptions, then why shouldn't they be?

Multiple return or output parameters

For the "saturation" family of functions all of them have exactly only 1 output parameter, so these are uncontroversially output by return. However, this is not the case for the others requiring more than 1 output value to completely define the resulting state. As [P3018] indicates, many of the existing extensions that support these functions output the extra state by passing a pointer to an output parameter as an argument to the function. I agree with the assessment of [P3018] that all output values should be returned because:

constexpr, much more convenient to handle it this way.
it should be a free standing output and not require a user to instantiate an object in order have address to receive the extra output value if they so wish to discard it.

Return type

There are several ways to return multiple output values. For the cases where there's only 1 output value, just return the value as is with no additional structure. For the other cases these were the options considered:

Using an existing data structure other than a tuple, such as std::pair
Using a new dedicated data structure
Use a tuple

Here is a table for the required output values for each function (where T is the underlying integer type)

Type of operation	Standard	Saturated	Safe overflow	Wide arithmetic
Addition	T	T	(T, bool)	(T, bool)
Subtraction	T	T	(T, bool)	(T, bool)
Multiplication	T	T	(T, bool)	(T, T)
Division, Remainder	T, (T, T)	T*	(T, bool), (T, T, bool)	(T, T)
Type casting	T	T	bool	N/A

In my opinion it is best to stay away from using types like std::pair given their confusing ambiguity, take for example "mul_wide" return a std::pair assigned to a variable named val, what would be the meaning of "val.first" or "val.second"? Do you put the high bits in ".first" or ".second"? Unless the names of the member variables explicitly spell their meaning (ex. result_high and result_low) the whole thing is just hard to read, a dedicated data structure has better properties for this purpose.

If we use a new dedicated data structure, now the problem is what do we name them? They are relatively specific to the function that they are associated with, so one could for example use the pattern <function_name>_result (similar to what happens with std::from_chars), example "mul_wide_result".
In addition, the type would need to be templated since the function parameters and consequentially the types of the expected output is also templated. This would require defining at least 8 new templated data types, in most cases supporting 10 different integer types, for a potential of 80 new data structures in practice. This isn't a big deal, compilers can certainly handle that quite easily, and the effort for implementing those is proportional to the effort of implementing the new features, but can we do better?

I lament the fact that C++ doesn't have a better syntax and support features for multiple return values. std::tuple has become the closest thing to it, which for this case has some nice properties. There's no name confusion because there are no names, it doesn't require defining new types, presenting the signature of the function is almost completely sufficient, needing only a side note specifying which value means what (and only for a couple of cases).
Using std::tuple also provides some quite satisfying syntax in order to split the return values such as:


			auto [value, remainder] = div_wide(...);

			std::tie(value, remainder) = div_wide(...);

Performing std::get<0>(func(...)) would provide the same result as the trivial operators (+, -, *, /) would in most platforms, performing std::get<1>(func(...)); would recover information lost by the trivial operators (+, -, *, /) (i.e. overflow bit, carry/borrow bits, high bits in mul_wide, and the remainder on division).
While not perfect, it is quite suitable for the intended purpose, without visible undesirable features except in one specific case "mul_wide".

"mul_wide" has to output high and low bits of the resulting multiplication, in platforms that don't have direct hardware support for a specific type but has a wider integer type an implementer may want to implement "mul_wide" by internally casting the inputs to the wider integer, doing a trivial multiplication and then splitting the high and low bits. By defining that the low bits are to be return on the first element of the tuple, on little-endian machines it just so happens that the memory layout matches the fact that low bytes of an integer come first allowing a compiler to optimize away any bit manipulation in all cases. But we need to acknowledge that big-endian machines also exist, and it just so happens that the expected byte order is reversed, a compiler can still optimize away bit manipulation depending on what happens on assignment, but it might not be possible do so in all circumstances if the value needs to be passed to an unknow context (like taking an address). Using a new specialized data structure "mul_wide_result" that only defines that "result_high" and "result_low" must be members of it without specifying the order in which they should appear, an implementer would be free to swap the values around to pick the best memory layout for their specific platform.
In my opinion, considering that big-endian devices are now a days relatively rare, and considering the rare situations where a compiler couldn't just optimize the whole thing anyway, that most use cases are for a widest type available on the platform were this trick couldn’t be used anyway, and it just "mul_wide" that has this problem, this shouldn't justify using new data type over a std::tuple.

With all things considered, I personally prefer the std::tuple approach.

Supported types and behaviour

For the most cases the functions don't have weirdness in their behaviour, but there are a few worth mentioning and worth documenting.

To understand, and solve those issues, there is one major assumption that I will be using to justify it.
Signed integers are two's complement, this shouldn't be controversial as it mostly already is "in practice". This will simplify the behaviour as most operations for signed types can be done using their unsigned counterparts, and the differences can be mostly attributed to casting a narrower type to a wider one. While unsigned integers when converted to a wider type will always pad the new high bits with 0's, sign integers will pad high bits with either all 1's or all 0's depending on the sign bit, this later behavior will be referred as sign-padding.

mul_overflow

mul_overflow stands out as the only one with some issues worth mentioning in unsigned arithmetic. When there's no overflow the value output will return the result of the multiplication while the overflow flag is false. When it overflows, the overflow flag should be true, but what should the value portion be? If one see's it as a scaled down version of "mul_wide" then this should be the low-bits of the multiplication with the high-bits discarded. I.e. The result would have the most significant bits discarded. While the overflow flag is always useful, and the value is useful when it doesn't overflow, I personally don't know of any algorithm that could make use of a value that can not be reconstructed. It might be more useful, specially in platforms that don't support this operation natively, to just assume that the value portion is undefined when it overflows and be able to implement a compatible version more cheaply.
If the value were to be defined, I can not think of anything else that it could be other than the low bits of "mul_wide". The question that I leave here is, should it be defined when it overflows?

add_carry

While the unsigned version of "add_carry" is straight forward, the signed version of it raises some questions:

The input values can be positive or negative, should the input carry flag allow to change the value down as well as up?
There are 2 types of "overflow", positive overflow and negative overflow. How do you tell the difference and how does one interpret the value portion of the output?

Given the aforementioned assumption of two's complement, you can implement addition by simply adding the numbers as if they were unsigned, the only difference is that you wouldn't look at the unsigned carry flag to detect an overflow but rather at a "signed overflow" flag. A signed overflow can only occur if both input arguments have the same sign bit and if the resulting value has a different sign bit set.
Lets suppose we were trying to add 2 signed Big numbers composed of 4 octlets each. The theoretical principle by which addition works for numbers with larger bit counts is the same, i.e. add the number as usual as if it were unsigned and take the signed overflow flag instead for which only the highest bit is relevant. To do this one would start by adding all of the lower octlets using the unsigned version of "add_carry", except for the highest octlet which would use the signed version of "add_carry" which would use as an input the same "carry" bit carried over from the previous unsigned operations and the "add_carry" would perform the operation in the exact same way except the "signed overflow" flag is returned.
If the overflow flag is returned we would know that 4 octlets wouldn't be enough to represent the value, which due to two's compliment way of representing number, the only thing extra that we would need in order to get an exact value in Big number notation would be to allocate an extra 5th octlet to take in the extra overflow and perform sign-padding, the sign value is equal to the sign bit of either of the input values or the negation of the sign bit in the output value.
And this should answer our questions:

The carry flag for signed add_carry, should also be 1 bit working exactly in the same way as in the unsigned version.
The returned "carry" or rather "signed overflow" would indicate that the signed version would require additional bits to represent the number. The value retuned would indicate the lower bits of the expected number, and one would be able to recover the actual number by padding additional words depending on the sign bit.

One maybe still be uncomfortable with the fact that the value being returned is a "signed value", and in case of overflow the sign information has just been carried away, and (value < 0) would not indicate that the resulting value is negative (but that quite the opposite is true).
Should the output value be unsigned and the best we can do is to interpret it as a bunch of bits? No. The result should definitely be a signed one, given that if the operation does not overflow one can just simply read the value as is, but if it does overflow it is also not wrong. Although it may bother some that the sign bit would indicate the wrong sign of the actual resulting value, this is mostly a human problem specific to some. Nobody is bothered when an unsigned addition overflows and (value ≤ UINT_MAX) when in fact the result is bigger than that, this is no different.
The sign bit isn't really a "sign" bit, one does not look at every other bit to figure out what is the absolute value of the integer and then only decide the sign based on the "sign" bit, the whole number needs to be read holistically to be understood, it is not a thing where you can omit bits and still make sense. I myself prefer to interpret the nature of two's complement signed integers in light of modular arithmetic, that all numbers can be interpreted in terms of their positive or negative modular counterparts, and the "signed" bit only indicates that we should prefer the negative form when interpreting the number.

The point of this is, it has weird but well-defined behaviour. It probably should be written out in the standard to avoid confusion.

sub_borrow

"sub_borrow" has a similar problem to "add_carry", and is fixed in the exact same way, except sub_borrow can only overflow if the input sign bits are different and is detected when the output sign bit is different from that of the left operand, everything else works the same.

add_overflow, sub_overflow

Should be interpreted the same as add_carry/sub_borrow (respectively) except the input carry/borrow flag is fixed too 0.

mul_wide

Signed "mul_wide" also raises the question, should the low-bits of the output be unsigned? After all the low-bits doesn't contain the "sign" information.
My answer here is, It doesn't matter! As explained in the addition case, sure you can look at the most significant bit to get an indication of what is the sign of the number, but you should look at the number as a whole number. In must be clear in the standard that for a N width integer the interpretation of the number should be:

high_bits << 2^N | low_bits

and not

high_value × 2^N + low_value

I personally prefer both output values to just be signed, it makes it more simple and consistent.

div_wide

Signed "div_wide" has the complementary problem "mul_wide", except the problem appears on the input parameters instead of the output value. The most important thing is that they are symmetrical, "div_wide" inputs should be the same signedness as the "mul_wide" outputs.

mul_add_wide

Before explaining the problem with signed mul_add_wide lets consider how does one would implement a signed multiply without overflow.
A trick that can be used as before is just multiply the numbers together as if they were unsigned and the math just works. If we applied the same trick but preserving the "high-bits" in an hypothetical double wide multiplication, the high-bits would be wrong, but for non-overflow multiplication just discard the high bits at it will be fine.
To implement a signed mul_wide you could first double the width of the input arguments first, sign-pad the high bits, then perform an unsigned multiplication. If for example we were trying to multiply 2 signed Big numbers composed of 2 octlets each, it could be implemented by performing an "as-if" conversion of the input arguments to 4 octlets each then performing an unsigned multiplication by cross-multiplying all of the octets and adding all the carrys.
The problem with this is while unsigned "mul_add_wide" is useful to reduce the complexity of Big num multiplication, both signed and unsigned Big num multiplication can use unsigned "mul_add_wide". I personally don't know of any algorithm that would use the signed version of this function in order to tell you what it's supposed to look like, more specifically should the addition parameter be signed or unsigned, i.e. should the addition parameter participate in sign-padding. There's the temptation of saying the addition parameter should be the same signedness as the multiplying parameters, otherwise it would the odd one out where signed and unsigned types are mixed. But while someone for sure would want to use "mul_add_wide" in this way, actually mixing signed and unsigned numbers isn't unprecedented. Signed add_carry does this, where the carry input can be seen as a 1 bit unsigned integer as it results in always adding up and never down, sub_borrow the oposite, always down and never up. Is there a better multiplication algorithm where a signed "mul_wide" would be preferable but the additive term shouldn't participate in sign-padding?
I don't have an answer to that. As far as I know, maybe yes, maybe not, maybe you need both.
Should "mul_add_wide" have only 1 templated parameter or 2?
If we decide that we only need 1 now, but then find out that it would be better to have 2, this will eventually break backwards compatibility. If we decide to have 2 but only need 1 in practice, this will end up as a useless quirk in the standard (I would like to avoid that). Perhaps mul_add_wide should wait for future work.

Naming

Names like add, sub, mul, div, rem, have long standing unwritten conventions as to what they should mean, its usage is unambiguous, there is precedent for it, and they are all exactly 3 characters long which makes them look really nice when aligning them together. So I propose to keep that. The rest of the text in the function name are mostly plain English and are well know in regards to their meaning, so I advocate to just keep them as I have written them in the table.

You will either like them or you won't, it is mostly just preference.

Feature test macro

I propose the usage of __cpp_lib_overflow_arithmetic as a test feature for all families and remove __cpp_lib_saturation_arithmetic.

Action plan

I propose the following for future work:

Ammend the feature test macro from __cpp_lib_saturation_arithmetic to __cpp_lib_overflow_arithmetic.
Discussion on either or not div_sat should be eliminated, and if div_overflow and div_rem_overflow shouldn't be introduced with the rest.
Get an opinion from library implementers and prospective users on either or not the value portion "mul_overflow" should be undefined on overflow.
Get an opinion from library implementers regarding if signed "mul_add_wide" should have a signed, unsigned, or both signdness types for the additive parameter.
Gauge consensus on the naming, and return type of the proposed functions.
Formalize a standard proposal taking into account the formed consensus.

Acknowledgements

Thanks to Melissa for the observations regarding the problematic usage of cryptography as motivation and on some spelling issues.

Thanks to Jens Maurer and Giuseppe D'Angelo for corrections regarding the standard interpretation of the accepted data types in the saturated families.

References

ISO/IEC JTC1 SC22 WG21 P0543R3: Saturation arithmetic by Jens Maurer
ISO/IEC JTC1 SC22 WG21 P3018R0: Low-Level Integer Arithmetic by Andreas Weis