Skip to content

Metadata FBC3 #1440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 165 commits into
base: devel
Choose a base branch
from
Open

Metadata FBC3 #1440

wants to merge 165 commits into from

Conversation

pascalaldo
Copy link

Updated version of the SBML FBC version 3 implementation #1237 by @akaviaLab and earlier #988 by @Hemant27031999.

Implements:

  • SBML CVterms as StandardizedAnnotations. These were accessible before in a simplified version in cobrapy, but now all their information can be accessed.
  • SBML FBC3 KeyValuePairs as CustomAnnotations. Key-value pairs (optionally with URI).
  • History + creators (this hasn't changed much since the previous pull request).
  • A Metadata class that enables access to all this information, accessible through a Object's metadata attribute.
  • A SimplifiedAnnotationInterface, accessible through an Object's annotation attribute. This interface behaves like a dictionary for backward compatibility and allows access to an Objects StandardizedAnnotations.
  • Support for floating point (double) charge values (part of FBC3).

For reference, the FBC3 specification can be found here: https://github.com/sbmlteam/sbml-specifications/blob/develop/sbml-level-3/version-1/fbc/spec/sbml-fbc-version-3-release-1.pdf

Major differences with the two previous pull requests:

  • Latest changes in the cobrapy devel branch have been merged here.
  • Based on comments on the previous pull requests, CVTerm has been renamed to StandardizedAnnotation and KeyValuePair to CustomAnnotation.
  • Standardized and custom annotations are stored in StandardizeAnnotationStore and CustomAnnotationStore, respectively. The annotations keep track of their parent and can be removed using annotation.remove_from_parent(), similar to how metabolites etc. do this. This is also implemented for Resources with respect to their parent StandardizedAnnotation. Theres is also a StandardizedAnnotationList, which is basically the same class as StandardizedAnnotationStore, but it does not change the ownership/parent-child relationship. So this clas can be used as a return type for a selection of existing annotations (a bit like a view).
  • StandardizedAnnotation has been simplified a bit. Previously, it would contain a qualifier and a list of ExternalResources, which could contain resources and nested annotations. The external resources class has now been removed and StandardizedAnnotations directly contain resources and nested annotations. A Resource is now its own class that handles interepreting URIs.
  • CustomAnnotations (key-value pairs) do not prominently feature a name and id anymore, since I think they make them more confusion. They can still be set, but they are not part of the __init__ method etc. A CustomResource now also inherits from the Object class, so it can have its own metadata. The FBC3 specification is a bit vague here, since it is not explicitly mentioned that KeyValuePair inherits from SBase, but it can have a name and id, fbc-21501 mentiones that it can have an sboTerm and fbc-21502 mentions that it can have notes and annotations/metadata, which is bascially what is implemented in the cobrapy Object class.
  • Qualifier names are now also more readable (e.g. Biological_is vs. bqb_is, Modelling_isDerivedFrom vs bqm_isDerivedFrom).
  • The Metadata class is available as an Object's metadata attribute, instead of annotation. The annotation attribute is now a class instance that bascially is a dict-like view of the metadata.standardized standardized annotations (SBML CVTerms).
  • Creator names are saved as a single name value (as supported by SBML L3 v2 Core), not separate first and last names, which are not applicable for all cultures (also mentioned in discussion around previous pull requests). In addition, this handling names. For older files that use separate given + family names, these names are simply concatenated.
  • As mentioned in comments on the previous pull requests, the standardized annotations can now easily be exported in a flattened format, e.g. using the to_records method, of which the output can directly be used in pandas (demonstrated in the metadata jupyter notebook).
  • Added a function to determine default qualifiers, for when not explicitly setting a qualifier (e.g. using the old annotation interface), this can probably be expanded/improved.
  • Updated the json v2 schema.
  • Output now makes use of SBML L3 V2 features and FBC3 features.
  • Added more tests and documentation, code has high test coverage.

Outstanding issues:

  • All tests are passing, except one. This test is for saving + reading the annotations of a key-value pair (CustomAnnotation). As mentioned above, it is explicitly stated that KeyValuePair, like an SBase instance, can have annotations. Reading files that have this does workm, but saving does not, since this was not implemented in libsbml. I will open an issue in the libsbml repo addressing this.
  • Cobrapy will now output SBML L3 v2 Core + FBC3. However, the FBC3 extension spec was written for SBML L3 v1. I'm not sure whether this is an issue.
  • I have not looked at other features in FBC3, besides the annotations and floating point charge. We could probably add a test file + some test that go over all the new features.
  • Navigating standardized annotations can be improved, I think. The combination of nested annotations, allowing multiple entries with the same qualfier and multiple resources with the same namespace makes selecting a single resource quite a hassle. I tried to implement some convenience methods, but there is still some room for improvement.
  • I should add some examples to the docstrings. For now, I mainly updated other parts of the docstring and added examples to the jupyter notebook, but some examples should also be given in the docstings of methods.
  • Some of the parameter and return types in the docstrings are quite messy, since I've made some methods quite flexible (e.g. allow a single object or a list and the object can be of any type that can be intrepreted as an instance of this class).

Smaller issues:

  • We could probably remove some of the methods of some of the new classes, since there is some overlap in what they do.
  • Comparing resources is not really robust, since different variants of an identifier URI could point the same resource, but they are not considered identical (e.g. http:// vs https://). We could prabably improve the logic there.
  • We're not checking whether resources actually exists (not sure whether this is problematic).
  • There could be more guidance for qualifers, e.g. which resource namespace + qualifier can never be used.
  • We could add warnings when using the old annotation interface, since it can mess up the standardized annotations.
  • Not really related to this specific pull request, but I had to recreate the pickle files for testing. I'm not a big fan of having pickle files in the repo, since they pose quite a security risk.
  • Regarding the creators of an object: right now, a new Creator instance is created for each object with creator info. My tought is: would it be better to keep a list of unique creators and reference the same instance for each of its occurences in a model.
  • The History + creators part is quite minimal now, which is fine, but we could probably allow a user to set their default credentials and add a 'track changes' option or something, that automatically adds the user as a creator to new objects and maybe even adds the user to the list of creators when you make a change to an object.
  • There is some metadata related to objects that are not explicitly imported into cobrapy (e.g. some ListOf... tags), which we could store in a dict as part of the model, so that the metadata is available somewhere.
  • It would be a good idea to consider how cobrapy could store additonional data as CustomAnnotation. An example could be the value of functional for a gene, as proposed in issue [BUG] Gene.functional attribute not saved in SBML model #1422 . One could think of storing its value udner the key 'cobra_gene_functional' for example. Questions would be: should these values still be available under metadata.custom? And what happens when that value changes? Should it have a way of storing the datatype (e.g. 'bool:true') and should that be implemented as part of the CustomAnnotation class? This is all not really urgent, but if some changes are needed to the custom annotation interface, we might want to make them now.
  • Some class, method and attribute names could be changed. I'm not a huge fan of the Resource class name, since it's a bit vague, but I wasn't sure what else to name it. In previous discussions it was also mentioned that Author may be a better name than Creator. The StandardizedAnnotation and CustomAnnotation class names are quite long, so we could try to rename those too. Maybe Annotation instead of StandardizedAnnotation and Property/Symbol/Attribute/Parameter/Flag/... instead of CustomAnnotation.

Fixes issues:
#810: The sbml_info storing basic information of SBML is written to JSON to store the basic SBML document information like packages, level, version, notes, annotation attached to the SBML component etc.
#684: The complete metadata structure has been redesigned. A compatibility interface remains in the annotation attribute of each object, whilst all structured metadata can be accessed through the metadata attribute. I've split this up, since changing values through the old interface may change the structure/hierarchy or qualifiers. By having a separate annotation and metadata attribute, it is ensured that changing anything in metadata does not have a destructive effect. Additionally, old code and old file formats (e.g. json schema v1) can just write to and read from the annotation interface as if nothing has changed, while new code and formats (e.g. json schema v2) use the metadata attribute.
#937: As mentioned above, using the metadata interface should not lead to loss of any relation information.

Let me know what you think of the changes. In the meantime, I will try to fix some of the outstanding issues.

…notationList to serve as alternative to StandardizedAnnotationStore, but without ownership.
…fiersAlias enum to handle common sets of qualifiers.
…vides a powerful interface to select resources.
…t cases, improved implementation. Almost all tests are still passing, except some that seem to be caused by lacking implementations in libsbml.
…c-20304). Only remaining failing test is missing implementation of saving nested annotations of KeyValuePairs (fbc-21502).
@Midnighter
Copy link
Member

Fantastic, thank you for picking up the work. I'll need to reacquaint myself with the changes as it has been too long ago, but I will try to review soon so we can finally publish this work.

@pascalaldo
Copy link
Author

Thanks!
I see in the CI pipeline that python 3.8 has 1 more set of tests failing, so I'll fix those.

…RLs and compact urls with integrated namespace) should all work correctly now. Changed Resource equality to namespace+identifier equality, instead of URI equality. Added tests for these cases.
…identifiers.org identifiers as URI. Added the appropriate tests for that. Made hashing method consistent with equality method.
@pascalaldo
Copy link
Author

pascalaldo commented May 1, 2025

For reference, I fixed/started fixing the following:

All tests are passing, except one. This test is for saving + reading the annotations of a key-value pair (CustomAnnotation). As mentioned above, it is explicitly stated that KeyValuePair, like an SBase instance, can have annotations. Reading files that have this does workm, but saving does not, since this was not implemented in libsbml. I will open an issue in the libsbml repo addressing this.

I opened an issue in libsbml and got a response (sbmlteam/libsbml#429).

Cobrapy will now output SBML L3 v2 Core + FBC3. However, the FBC3 extension spec was written for SBML L3 v1. I'm not sure whether this is an issue.

Looking at this issue in the libsml repo, it should not be an issue: sbmlteam/libsbml#360 At least from a technical point of view, mixing FBC3 with L3 v2 Core should be fine.

I should add some examples to the docstrings. For now, I mainly updated other parts of the docstring and added examples to the jupyter notebook, but some examples should also be given in the docstings of methods.

Added some, will add more.

Comparing resources is not really robust, since different variants of an identifier URI could point the same resource, but they are not considered identical (e.g. http:// vs https://). We could prabably improve the logic there.

I took a look at the Resource/identifiers.org logic and found out that the problem is slightly more complex. There are two types of urls, an old variant: http(s)://identifiers.org/<namespace>/<identifier> , and a new (compact identifier) URL variant: http(s)://identifiers.org/(<provider>/)<compact identifier> . Where <compact identifier> is typically of the format <namespace>:<identifier> (where does not have to be lower case). So matching the format http(s)://identifiers.org/(<provider>/)<namespace>:<identifier> will mostly give you the correct namespace and identifier, after converting the namespace to lower case. However, there are some exceptions, where the namespace is integrated in the identifier, e.g. for ChEBI. In those cases, an URL would look like https://identifiers.org/CHEBI:36927 and the corresponding namespace and identifier are chebi and CHEBI:36927, not just 36927. I created a script in the scripts directory to pull the namespace data from the identifiers.org API and extract cases like this. I hardcoded the ones I found now into the resources.py file.

Another issue with the two types of URLs is that their patterns overlap. An identifier can have a colon (:), which is the case for biocyc. The biocyc identifiers.org url https://identifiers.org/biocyc/ECOLI:CYT-D-UBIOX-CPLX is meant to be an old URL variant (and is resolved like that), but it does match the new URL pattern, where the provider would be biocyc, namespace ecoli and identifier CYT-D-UBIOX-CPLX. It's hard to correctly interpret this kind of URLs without knowing all possible namespaces and providers. To resolve this I also hardcoded a list of namespaces that can have identifiers that include a colon. This is then used to (probably in most cases) choose the right URL variant.

There are two instances where the compact identifier does not 1:1 match the <namespace>:<identifier> pattern (oma.hog and eo). However, both instances are actually broken on identifiers.org (clicking the sample URL gives an error or the service is down).

But all in all, the new identifiers.org interpretation function should handle almost all cases correctly. In other cases, you can always set Resource(..., strict=False). Resource equality is now also based on the namespace+identifier combination, instead of the URI. So different URL variants of the same entity are now considered equal.

I see in the CI pipeline that python 3.8 has 1 more set of tests failing, so I'll fix those.

This is fixed now.

pascalaldo added 3 commits May 5, 2025 10:51
…s not being written to SBML. Updated tests. Added xfail test for nested kvps.
…n adapted accordingly and some tests are marked for skipping when run with python 3.8.
@pascalaldo
Copy link
Author

The latest commits addressed the following issues:

  • A new version of libsbml was released, which implements nested annotations for key-value pairs (CustomAnnotations). This solves the first issue mentioned in the initial comment of this pull request. There are no prebuilt files for python 3.8, so I made 3.8 use an older libsbml version and marked the failing test for skipping for python 3.8. All tests should now pass.
  • I ran the benchmarks and improved the performance of the metadata implementation. One of the main improvements is that, by default, the metadata attribute of objects is None. The getter handles creating a new instance when needed. The same is done for metadata.standardized and metadata.custom. Some methods have been adapted to check for _metadata is None to avoid creating a lot of unnecessary empty instances. The other optimization is that the different metadata classes have their own implementations of __deepcopy__.

The new metadata implementation still makes cobrapy slower, which is expected, since extra complexity is added to each object. The comparison between the devel branch and the metadata_fbc3 branch is as follows (100 repeats of each benchmark):

                                                 test      time [ms] benchmark_devel  time [ms] benchmark_metadata_fbc3  fraction
11      test_change_objective_benchmark[optlang-glpk]                       0.368428                           0.337824  0.916934
12      test_single_gene_deletion_fba_benchmark[glpk]                     155.484160                         143.340742  0.921899
13  test_single_gene_deletion_linear_moma_benchmar...                     340.725976                         324.400944  0.952088
18                          test_pfba_benchmark[glpk]                     396.682740                         377.728513  0.952218
21                test_double_gene_deletion_benchmark                     304.624596                         291.659847  0.957440
0                 test_add_metabolite_benchmark[glpk]                      89.805612                          86.212051  0.959985
5          test_subtract_metabolite_benchmark[gurobi]                       0.004246                           0.004109  0.967695
23                     test_loopless_benchmark_before                       8.797361                           8.534670  0.970140
19              test_flux_variability_benchmark[glpk]                     533.597054                         517.876455  0.970538
26                  test_minimal_medium_mip_benchmark                      11.156804                          10.839114  0.971525
2               test_add_metabolite_benchmark[gurobi]                      90.846794                          88.270271  0.971639
15      test_single_reaction_deletion_benchmark[glpk]                      38.700004                          37.846041  0.977934
30                        test_optgp_sample_benchmark                       0.640308                           0.632522  0.987840
14  test_single_gene_deletion_linear_room_benchmar...                     440.968414                         442.025780  1.002398
22            test_double_reaction_deletion_benchmark                     451.311932                         454.369495  1.006775
1                test_add_metabolite_benchmark[cplex]                      87.997816                          89.049236  1.011948
17                 test_geometric_fba_benchmark[glpk]                     623.657410                         633.473743  1.015740
6                        test_gpr_symbolism_benchmark                     235.993363                         239.716099  1.015775
28                         test_achr_sample_benchmark                       0.457210                           0.464668  1.016313
24                      test_loopless_benchmark_after                       2.262236                           2.328518  1.029299
7                         test_gpr_equality_benchmark                      34.251253                          35.798897  1.045185
25               test_minimal_medium_linear_benchmark                      11.135063                          11.647122  1.045986
20     test_flux_variability_loopless_benchmark[glpk]                     244.691224                         265.261246  1.084065
3            test_subtract_metabolite_benchmark[glpk]                       0.004014                           0.004369  1.088558
29                          test_optgp_init_benchmark                     150.744128                         192.586441  1.277572
27                           test_achr_init_benchmark                     152.356126                         200.047689  1.313027
16                        test_fastcc_benchmark[glpk]                      99.535172                         135.695934  1.363296
4           test_subtract_metabolite_benchmark[cplex]                       0.004563                           0.007871  1.724809
10                            test_deepcopy_benchmark                      33.165371                          70.638800  2.129896
9       test_copy_benchmark_large_model[optlang-glpk]                     599.438748                        1366.276535  2.279260
8                   test_copy_benchmark[optlang-glpk]                      22.231131                          55.543810  2.498470

The worst performance hit still happens for copying-related benchmarks, maybe that can be optimized further. The highest increase in runtime happens for test_copy_benchmark, which became ~2.5x slower (down from ~3.5x before optimizations), would this be considered acceptable?

pascalaldo added 2 commits May 5, 2025 20:23
…overhead. Metadata should not have any recurring (identical) objects, so those checks are not necessary.
@cdiener
Copy link
Member

cdiener commented May 6, 2025

This is shaping up very nicely. Thanks!

The highest increase in runtime happens for test_copy_benchmark, which became ~2.5x slower (down from ~3.5x before optimizations), would this be considered acceptable?

Not an issue from my side. It's not a massive change and we are talking about a fairly fast operation already (~1s). Does it affect SBML parsing as well? I think we don't have a benchmark for this but it might be good to know how long it takes to read Recon3 for example with the old and new version.

The review might take a while because so many files got changed, but the following would help me at least:

  1. Could you maybe comment with a small example on how to add custom and standardized annotations to a new Model and roundtrip them? We would need this for the docs anyway and it would help to understand the API.
  2. How does backward compatibility look right now? It looks like the changes are fully compatible but you would probably know better.

I am personally really excited for the custom annotations because that would really fix some pain points we have in MICOM.

@pascalaldo
Copy link
Author

@cdiener Thank you, great to hear!

Does it affect SBML parsing as well? I think we don't have a benchmark for this but it might be good to know how long it takes to read Recon3 for example with the old and new version.

Yes, it will. I would estimate that for files with a lot of annotations it will be in the same range (let's say >2x slower). I will look at adding a benchmark for this.

Optionally, we could add a keyword argument (e.g. load_metadata) to the load/read functions (e.g. read_sbml_model) to optionally disable loading metadata. That could be useful in some pipelines etc. when you really just care about performance.

The review might take a while because so many files got changed

Yeah, that's understandable.

Could you maybe comment with a small example on how to add custom and standardized annotations to a new Model and roundtrip them? We would need this for the docs anyway and it would help to understand the API.

I added a jupyter notebook with an example workflow to the documentation, so that this is a chapter/page of the docs. I will look at that again and provide some more scripts so that you can play around with the new metadata system.

How does backward compatibility look right now? It looks like the changes are fully compatible but you would probably know better.

It should be 99% backwards compatible (as in: old scripts should almost always still work). The annotation attribute that used to be a dict, still behaves like one. It inherits from MutableMapping, so any explicit check for type or the use of isinstance(obj.annotation, dict) will fail (you would need to use isinstance(obj.annotation, collections.abc.Mapping) instead, which is True for the new class and actual dictionaries). Another case that may not work is directly writing the annotation attribute to json (we could implement this by adding a toJSON method, if you think that would be useful). All typical dict methods are implemented though, so most usage should just work. I did a github search for cobra + the use of the annotation attribute and all the uses I found should keep on working. One other project used annotation.copy() and expected the result to be an actual dictionary, so I did implement it like that (i.e. SimplifiedAnnotationInterface.copy() returns a dictionary, not a SimplifiedAnnotationInterface instance).

The new save formats (SBML l3 v2 core + fbc3 and json schema 2) can of course not be read reliably by older cobrapy versions.

I am personally really excited for the custom annotations because that would really fix some pain points we have in MICOM.

Great!

I am still a proponent of renaming StandardizedAnnotation and CustomAnnotation to something else, preferably just one word, just for their ease of use.

I saw you ran the CI/CD again and it failed for some environments. The failing tests are all tests/test_io/test_web/test_load.py::test_remote_load[BIOMD0000000633-50-35], which tries to connect to the biomodels server. This test intermittently fails, probably due to something out of our control (network/server issues). It does not seem to have anything to do with the code changes, it just happens on some days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants