Skip to content

Symmetry and Data Augmentation#626

Open
Scienfitz wants to merge 57 commits intomainfrom
feature/invariance_augmentation
Open

Symmetry and Data Augmentation#626
Scienfitz wants to merge 57 commits intomainfrom
feature/invariance_augmentation

Conversation

@Scienfitz
Copy link
Collaborator

@Scienfitz Scienfitz commented Aug 21, 2025

Implements #621

New: A Symmetry class which is part of baybe Surrogates. Three distinct symmetries are included, for more info check the userguide and for a demonstration of the effect see the new example. The ability to perform data augmentation has been included for all symmetries.

I have left some initial comments on design questions that are still open or where I am kind of indifferent and just had to choose one. Feel free to leave an opinion there first so the large-scale design picture can be finalized independent of small comments.

TODO

  • CHANGELOG (after architecture is logged in)
  • Remember to compress finalized svg picture

Other Notes
Symmetries and constraints are conceptually so similar that they should probably have the same interface. The design here has been done from scratch completely ignoring the constraint interface because it is already known to be not optimal and needs refactoring.

  • There is no overarching shared attribute parameters or similar because some symmetries allow single and some multiple such parameters. Instead the parameters are treated like the objectives treat target(s)
  • Contrary to dependency constraint, dependency symmetry can only hold 1 set of dependencies. The constraint should be refactored to look the same.
  • There is one exception I made but I'm open to still changing it

@Scienfitz Scienfitz self-assigned this Aug 21, 2025
@Scienfitz Scienfitz added the new feature New functionality label Aug 21, 2025
@Scienfitz Scienfitz linked an issue Aug 21, 2025 that may be closed by this pull request
@AVHopp
Copy link
Collaborator

AVHopp commented Aug 25, 2025

@Scienfitz just to make sure - I guess since this is marked as a draft, you do not require a PR review for now, right? Is there anything else that we can assist with?

@Scienfitz
Copy link
Collaborator Author

@AVHopp yes exactly and it will always be like that for PR's that I open in draft: Ignore until requested or asked in any other way

@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch from 00cfef8 to 5ba4cb1 Compare September 10, 2025 08:10
@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch 3 times, most recently from bc671eb to db98f64 Compare September 25, 2025 10:32
@Scienfitz Scienfitz force-pushed the feature/invariance_augmentation branch from db98f64 to aedafa7 Compare September 25, 2025 10:41
@Scienfitz Scienfitz marked this pull request as ready for review September 25, 2025 11:01
Copilot AI review requested due to automatic review settings September 25, 2025 11:01
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far this is a large monolith of a file. Large part of this is validation, so I can see how this is just split up into

  • symmetries
    • symmetries.base
    • symmetries.core
    • symmetries.validation

However this would require outsourcing the content of the validate_searchspace_context, so they would effectively become empty hulls calling a utility imported from symmetries.validation. I also like the closeness of the Raises section and the functional content to the actual class. So Im not sure if I should outsource the validations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I would also prefer the current structure over the split that you described above. Food for thought: What about splitting as follows:

  • symmetries
    • symmetries.core
    • symmetries.permutation
    • symmetries.mirror
    • symmetries.dependency


# Object variables
# TODO: Needs inner converter to tuple
copermuted_groups: tuple[tuple[str, ...], ...] = field(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only exception where I did still somewhat follow the design of the corresponding constraint. Because all cosntraints share parameters this symmetry requires to specify copermuted_groups as separate attribute. But actually all the groups could be combined into a single attribute.

Instead of parameter_names=["a1", "a2"] and copermuted_groups=[["b1", "b2"],["c1","c2"]] we would have permutation_groups = [["a1","a2"], ["b1", "b2"], ["c1","c2"]]. This is certainly more lgical but maybe a bit less userfirendly if there is really just 1 group.

This would have as consequence that we need to (re)define what the parameters property even should contain or whether it should even be dropped.

In the absence of opinions this wills tay as it is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say that I find the concept of the copermuted_gourps very weird. Maybe I am misunderstanding, but isn't this the same as having multiple permutations at once? But I will ask you this in person, since I guess that this will be significantly easier to explain verbally.

Copy link
Collaborator Author

@Scienfitz Scienfitz Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not the same because the copermuted groups are not permuted independent of the primary group, but in exactly the same way as the primary group. Imagine it like

  • taking an iterable (the primary group)
  • then get all permutations of the index of that iterable
  • create all permutations of the original iterable by applying all permuted indices
  • Now you also apply the permuted indices to other iterables (called copermuted groups)

As I wrote in my comment I would be open to get rid of the primary and copermuted groups by combining them into one single attribute, but it would have a very different interface for Symemtries as consequence (where we even have to debate whether Symmetries (and bye xtension also constraints)) can have a propert like .parameter_names and what it would mean

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now get the concept of co-permuted groups. I am not sure about whether or not we would want to combine primary and co-permuted groups. My gut feeling is that co-permuted groups will often not be really necessary for the user, so sacrificing a nicer and cleaner interface for this does not feel right to me.
Do maybe @AdrianSosic and @kalama-ai have opinions here?

# Validate compatibility of surrogate symmetries with searchspace
if hasattr(self._surrogate_model, "symmetries"):
for s in self._surrogate_model.symmetries:
s.validate_searchspace_context(searchspace)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Validation so far is only part of the recommend call here in the recommenders. Validation has not been included in the Campaign yet. This is due to two factors

  • To properly validate the symmetries and searchspace compatibility there needs to be a mechanism that can iterate over all possible recommenders of a metarecommender. Otherwise this upfront validation already fails for the two phase recommender if the second recommender has symmetries
  • There would be double validation with campaign and recommend call so the context info of whether validation was already performed needs to be passed somewhere. Likely fixable with settings mechanism not yet available

@Scienfitz Scienfitz marked this pull request as ready for review November 3, 2025 17:31
@Scienfitz Scienfitz requested a review from Copilot November 4, 2025 08:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Scienfitz Scienfitz requested a review from Copilot November 4, 2025 11:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 27 out of 29 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Scienfitz and others added 2 commits November 4, 2025 12:58
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This comment was marked as outdated.

Copy link
Collaborator

@AdrianSosic AdrianSosic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Scienfitz, here my first batch of comments. Still need to finalize the dependency symmetry and the docs, but my github UI is getting so slow that I feel I need to submit the pending comments 😬

eval_during_modeling: ClassVar[bool]
"""Class variable encoding whether the condition is evaluated during modeling."""

eval_during_augmentation: ClassVar[bool] = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a REMOVED entry in the changelog?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo no as it was not really used


return PermutationSymmetry(
parameter_names=self.parameters,
copermuted_groups=(tuple(self.dependencies.parameters),) # type: ignore[arg-type]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • why is there an explicit tuple conversion here
  • why do we need the type silencing

--> Neither should be required when the field converters are properly set up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the attribute expects a tuple, so I am providing a tuple. Are you criticising this?

The type silence here was something that I actively tried to fix but that seems to be very hard. Now I dont remember the details tho I would have to check again

return tuple(x)


def normalize_str_sequence(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly introduced name is misleading and the introduced type hint is wrong, because:
you introduced this function when we enabled Booleans for categorical parameters. Both name and type hint suggest that the incoming object is already a sequence of strings, which is not the case – it's more generally a sequence of "convertible to string", and right now we use it for str/bool.

Comment on lines +127 to +144
def augment_measurements(
self, data: pd.DataFrame, parameters: Iterable[Parameter]
) -> pd.DataFrame:
"""Apply data augmentation to measurements.

Args:
data: A dataframe with measurements.
parameters: Parameter objects carrying additional information (might not be
needed by all augmentation implementations).

Returns:
A dataframe with the augmented measurements, including the original ones.
"""
for s in self.symmetries:
data = s.augment_measurements(data, parameters)

return data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call the dataframe measurements to be consistent with the remaining code, right?

Suggested change
def augment_measurements(
self, data: pd.DataFrame, parameters: Iterable[Parameter]
) -> pd.DataFrame:
"""Apply data augmentation to measurements.
Args:
data: A dataframe with measurements.
parameters: Parameter objects carrying additional information (might not be
needed by all augmentation implementations).
Returns:
A dataframe with the augmented measurements, including the original ones.
"""
for s in self.symmetries:
data = s.augment_measurements(data, parameters)
return data
def augment_measurements(
self, measurements: pd.DataFrame, parameters: Iterable[Parameter]
) -> pd.DataFrame:
"""Apply data augmentation to measurements.
Args:
measurements: A dataframe with measurements.
parameters: Corresponding parameter objects carrying additional information
(not needed by all augmentation types).
Returns:
A dataframe with the augmented measurements, including the original ones.
"""
for s in self.symmetries:
measurements = s.augment_measurements(measurements, parameters)
return measurements



def validate_is_finite( # noqa: DOC101, DOC103
_: Any, attribute: Any, value: float | Sequence[float]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the proper Attribute type hint here


# object variables
mirror_point: float = field(
default=0.0, converter=float, validator=validate_is_finite, kw_only=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why it's kw_only? Don't think it's strictly necessary here...

Suggested change
default=0.0, converter=float, validator=validate_is_finite, kw_only=True
default=0.0, converter=float, validator=validate_is_finite

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not 100% sure if it reasonable to assume a default here. What kind of data would be mirrored around exactly 0? Seems to me like an edge-case resp. some default that makes sense in theory but not really in pratice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? I think 0 is an obvious default because it will be the chosen value for all cases like only the magnitude matters which is a mirror symmetry around 0. I actually think it will be hard to find a real use case for it not being 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I misread that. I will leave this thread open as there is still the question regarding the kw_only where I do not have a strong opinion on.


@override
def augment_measurements(
self, df: pd.DataFrame, _: Iterable[Parameter]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df --> measurements

A dependency symmetry expresses that certain parameters are dependent on another
parameter having a specific value. For instance, the situation "The value of
parameter y only matters if parameter x has the value 'on'.". In this scenario x
is the causing parameter and y depends on x.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment regarding terminology, here and in other docstrings: since the name is "DependencySymmetry" and we talk about the "dependent parameters":

  • Wouldn't make sense to call the "causing" parameter the "independent parameter" instead? Would be more consistent and also solves the problem that "causing" sort of implies a causal relationship from x to y (in the sense that x physically affects y) which is not the case
  • You call them "dependent" parameters, which is fine to me, but why do you then call the corresponding attribute "affected"?

Copy link
Collaborator

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First set of comments, only on user guide and example without having looked at the code. Like them both quite a lot, comments are mostly minor.

Below we illustrate the effect of data augmentation for the different symmetries
supported by BayBE:

![Symmetry and Data Augmentation](../_static/symmetries/augmentation.svg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the left image of the DependencySymmetry, I think it would make more sense tao add some ticks to the y axis and mark the y.values in that way.

Comment on lines +48 to +51
:class: warning
Invariant kernels will be applied automatically when a corresponding symmetry has been
configured for the surrogate model GP. This feature is not implemented yet.
``` No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing: For me, the first sentence reads like "This is already implemented", so I would reformulate this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I am not (yet) fully done with the PR, maybe a suitable explanation would be "Ideally, invariant kernels will be applied automatically...."?

# (a common way of also resulting in permutation invariance) and have multiple minima.
# In practice, permutation invariance can arise e.g. for
# [mixtures when modeled with a slot-based approach](/examples/Mixtures/slot_based).
# BayBE supports other kinds of symmetries as well (not part of this example).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simply delete this sentence or replace it with a link to the user guide, because why mention this here?

Comment on lines +123 to +124
# We set up a constrained and unconstrained searchspace to demonstrate the impact of
# the constraint on optimization performance.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We set up a constrained and unconstrained searchspace to demonstrate the impact of
# the constraint on optimization performance.
# We set up a constrained and an unconstrained searchspace to demonstrate the impact of
# the constraint on optimization performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even mention the word "plain" here such that the description is closer the the code?

print(f"{'Without Constraint:':<35} {len(searchspace_plain.discrete.exp_rep)}")
print(f"{'With Constraint:':<35} {len(searchspace_constrained.discrete.exp_rep)}")

# We can see that the searchspace without the constraint has more points than the other
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively, I would have expected a factor of 2 as the difference. It is clear that this factor is not met explicitly due to the diagonal to me, but I'm asking myself if we should maybe add a comment like "We can see that the searchspace without the constraint has roughly doubler the number of points compared to the unconstrained one. This is exactly what we expect, as we only model one half of the parameter triangle. Note that it is not a factor of exactly 2 due to the points on the diagonal"

# subplot.

# BayBE can automatically perform this augmentation if configured to do so.
# Specifically, surrogate models have the `Surrogate.symmetries` attribute. If any of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why Surrogate.symmetries is not a link?

Copy link
Collaborator

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete, but cannot finish today

Args:
df: The dataframe containing the measurements to be augmented.
parameters: Parameter objects carrying additional information (might not be
needed by all augmentation implementations).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not needed by all types/implementations, why is then not an optional argument?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right I should give it None as default

copermuted_groups: tuple[tuple[str, ...], ...] = field(
factory=tuple, converter=tuple
)
"""Groups of parameter names that are co-permuted like the other parameters."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, co-permuted vs. copermuted is a bit inconsistent (but I think this is the only place where we have the -

Comment on lines +175 to +181
# The input could look like:
# - params = ["p_1", "p_2", ...]
# - copermuted_groups = [["a_1", "a_2", ...], ["b_1", "b_2", ...]]
# indicating that the groups "a_k" and "b_k" should be permuted in the same way
# as the group "p_k".
# We create `groups` to look like (("p1", "a1", "b1"), ("p2", "a2", "b2"), ...).
# It results in just (("p1",), ("p2",), ...) if there are no copermuted groups.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this something that belongs into the docstring of the method here: It makes clear to the user what is actually happening and also demonstrates what the copermuted groups are

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible for the first part, but I disagree for the second part (text after We create...) as this merely describes how the input for the utility is crafted which is meant for the dev not the user

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then please adjust the docstring and feel free to simply resolve here afterwards.


# Object variables
# TODO: Needs inner converter to tuple
copermuted_groups: tuple[tuple[str, ...], ...] = field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say that I find the concept of the copermuted_gourps very weird. Maybe I am misunderstanding, but isn't this the same as having multiple permutations at once? But I will ask you this in person, since I guess that this will be significantly easier to explain verbally.

Copy link
Collaborator

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review is still WIP, but since I won't be able to add more comments soon here is the first batch


ref_vals = set(cast(DiscreteParameter, params[0]).values)
if any(
set(cast(DiscreteParameter, p).values) != ref_vals for p in params
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you start at params[1] here as the first comparison will be between params[0] and ref_vals which is the same?

# are not considered here since technically for them this restriction is not
# required as al numbers can be added if the tolerance is configured
# accordingly.
if all(p.is_discrete and not p.is_numerical for p in params):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least from what I see in this file, it seems like Symmetry can be used for all kinds of search spaces. Wouldn't we then also need to check for ContinuousParameter here whether the bounds are the same as there is no tolerance?


# object variables
mirror_point: float = field(
default=0.0, converter=float, validator=validate_is_finite, kw_only=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not 100% sure if it reasonable to assume a default here. What kind of data would be mirrored around exactly 0? Seems to me like an edge-case resp. some default that makes sense in theory but not really in pratice

def augment_measurements(
self, df: pd.DataFrame, parameters: Iterable[Parameter]
) -> pd.DataFrame:
# See base class.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a more detailed docstring as it is not obvious what kind of information parameters should carry here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik the base class explains that this info is not used by all derived functions, is that not enough?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but this class now uses the parameters, so in my opinion there should be an explanation on the level of this docstring that explains how this is now used in this child class.

def validate_is_finite( # noqa: DOC101, DOC103
_: Any, attribute: Any, value: float | Sequence[float]
) -> None:
"""Validate that ``value`` contains no infinity/nan.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Validate that ``value`` contains no infinity/nan.
"""Validate that ``value`` contains no resp. is not infinity/nan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the Raises

Scienfitz and others added 12 commits December 12, 2025 16:28
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
Co-authored-by: Alexander V. Hopp <alexander.hopp@merckgroup.com>
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
Co-authored-by: Alexander V. Hopp <alexander.hopp@merckgroup.com>
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
Co-authored-by: Alexander V. Hopp <alexander.hopp@merckgroup.com>
Co-authored-by: Alexander V. Hopp <alexander.hopp@merckgroup.com>
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
UBOUND = 2.0


def lookup(df: pd.DataFrame, a=1.0, b=1.0, c=1.0, d=1.0, phi=0.5) -> pd.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have these parameters as arguments? The lookup is never used in any other place and since 4/5 parameters are simply equal to 1

@Scienfitz Scienfitz added this to the 0.15.0 milestone Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Data Augmentation for Invariant Contraints

5 participants