Feature non linear time averaged msd #5066

gitsirsha · 2025-06-15T18:16:34Z

Changes made in this Pull Request:

msd.py script was modified to support time-averaged mean square displacement calculation for dump files with non-linear time spacings.

PR Checklist

Issue raised/referenced?
Yes Issue Support for MDAnalysis packages dealing with non-linear time dump #5028
Tests updated/added?
Not Yet
Documentation updated/added?
Updated short documentation at the top of the msd.py as comments
package/CHANGELOG file updated?
Not Yet
Is your name in package/AUTHORS? (If it is not, add it!)
Yes - Added

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.

📚 Documentation preview 📚: https://mdanalysis--5066.org.readthedocs.build/en/5066/

…ar dump

github-actions

Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.

codecov · 2025-06-15T18:30:58Z

Codecov Report

Attention: Patch coverage is 16.66667% with 25 lines in your changes missing coverage. Please review.

Project coverage is 93.52%. Comparing base (0dc5a1b) to head (a7938f9).
Report is 3 commits behind head on develop.

Files with missing lines	Patch %	Lines
package/MDAnalysis/analysis/msd.py	16.66%	24 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #5066      +/-   ##
===========================================
- Coverage    93.62%   93.52%   -0.11%     
===========================================
  Files          177      177              
  Lines        22001    22029      +28     
  Branches      3114     3119       +5     
===========================================
+ Hits         20599    20602       +3     
- Misses         947      971      +24     
- Partials       455      456       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mrshirts

Added some comments on the code.

mrshirts · 2025-06-18T17:49:49Z

package/MDAnalysis/analysis/msd.py

+#   Modified by Sirsha Ganguly
+#
+# Modification:
+# '_conclude_modified()' is the new function added to calculate time-averaged mean squared displacement of non-linear dumps


Suggest something more descriptive here - _conclude_nonlinear() instead of _conclude_modified()

mrshirts · 2025-06-18T17:59:12Z

package/MDAnalysis/analysis/msd.py

+        if len(sequence) < 2:
+            return True
+        step = sequence[1] - sequence[0]
+        return all(sequence[i+1] - sequence[i] == step for i in range(len(sequence) - 1))


So right now, it's failing the test. My guess is that it's going down the wrong test path, because it's interpreting whatever data is being passed in by the test as not actually being linear (maybe something like 1.0000 gets interpreted at 0.999999). A couple of options here:

Add a keyword like nonlinear that controls the logic, rather than determining the logic from the steps.

Using something like assert_almost_equal so floating point errors don't cause the logic to give the wrong answer (i.e. that they are not equally spaced when they actually are).

IAlibay

Here are some initial questions before we get into the weeds. I'm keen for us to end up with a single method that works for both use cases if possible.

There are probably things we can do to make that happen, but I first went to get a good idea of where things are at in this current state.

IAlibay · 2025-06-19T19:09:20Z

package/MDAnalysis/analysis/msd.py

@@ -1,3 +1,25 @@
+######## Documentation


It's fine to do this later once code has been settled on, but documents exist as module docstring, or API docstring (see the sections encompassed by """ below).

As Irfan said: The comments will not be in the final PR. The first line of the file is always

# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-

and then comes the standard comment block. The only history information (eg who modified the file or added something) that we keep inside the code is versionchanged/versionadded markup in the doc strings. Everything else is known to be available through the git history and the information in the PR/issue.

Could you please

remove the comments

copy them to the PR information

This will make it a lot easier to discuss the intention of the PR separate from the implementation.

Information on the algorithm (basically what's important for a user to know) will go into the doc string in a Notes section (we follow the numpy doc string standard and see https://userguide.mdanalysis.org/stable/contributing_code.html#guidelines-for-docstrings for MDAnalysis specific guidelines on docs).

IAlibay · 2025-06-19T19:18:54Z

package/MDAnalysis/analysis/msd.py

@@ -1,3 +1,25 @@
+######## Documentation
+#
+# Issue: The default '_conclude_simple()' function previously calculated time-averaged mean squared displacement considering linear timespacing between the frames even when a dump file with non-linear timestep gap is provided


It would be better if these details were in the PR info instead - this is where a reviewer usually looks for them (and ultimately we will likely merge without this).

IAlibay · 2025-06-19T19:22:21Z

package/MDAnalysis/analysis/msd.py

+# Modification:
+# '_conclude_modified()' is the new function added to calculate time-averaged mean squared displacement of non-linear dumps
+    # Note: This new function is generalized for non-linear dumps.
+    #       Moreover, if you use this function to calculate for linear timedump this gives same result as '_conclude_simple()' which is implemented in the default version.


Thanks for checking this, it's really what we need to discuss the most here. Ultimately I think the end target is this PR will be to end up with just one method. Having alternate code paths adds to user difficulty and maintenance costs.

Just taking things as they are now (before we get into the weeds).Could you comment on performance? Do the two methods run at similar speeds with the same inputs?

What happens if someone uses the fft based method instead with non linear times? Does that fail?

orbeckst

Thank you for the PR. That's a good start for a discussion.

Initially it would really help to have the big picture up front, namely a description at a high level what you want to accomplish and how.

Additionally

Describe the algorithm that you're implementing.
How does it differ from the existing one (without FFT)?
Can you provide a rough performance comparison between "linear" and "nonlinear"? (I understand that applying the algorithm for equally spaced frames to a trajectory with variably spaced frames is incorrect but I want to get a sense of the necessary added computational complexity.)

I added other comments throughout but initially the big picture is more important because we are trying to figure out if there are ways to re-use code: the fewer code is duplicated, the lower the maintenance burden that we incur, and the easier it is to test and maintain correctness.

orbeckst · 2025-06-26T22:13:12Z

package/MDAnalysis/analysis/msd.py

@@ -1,3 +1,25 @@
+######## Documentation


As Irfan said: The comments will not be in the final PR. The first line of the file is always

# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-

and then comes the standard comment block. The only history information (eg who modified the file or added something) that we keep inside the code is versionchanged/versionadded markup in the doc strings. Everything else is known to be available through the git history and the information in the PR/issue.

Could you please

remove the comments

copy them to the PR information

This will make it a lot easier to discuss the intention of the PR separate from the implementation.

Information on the algorithm (basically what's important for a user to know) will go into the doc string in a Notes section (we follow the numpy doc string standard and see https://userguide.mdanalysis.org/stable/contributing_code.html#guidelines-for-docstrings for MDAnalysis specific guidelines on docs).

orbeckst · 2025-06-26T22:14:05Z

package/MDAnalysis/analysis/msd.py

 class EinsteinMSD(AnalysisBase):
+
+    @staticmethod
+    def is_linear(sequence): # This function is to check if the timesteps in the array is linear or not! If it's linear default MDA code will carry-out other wise the modified code carries!


Instead of comments, write doc strings.

orbeckst · 2025-06-26T22:16:13Z

package/MDAnalysis/analysis/msd.py

+
+
+    def _conclude_modified(self):
+        from collections import defaultdict


imports at top of file,

import collections

then use collections.defaultdict

orbeckst · 2025-06-26T22:16:50Z

package/MDAnalysis/analysis/msd.py

+
+    def _conclude_modified(self):
+        from collections import defaultdict
+        print("The Dump file has non-linear time spacing")


Eventually, remove all print – if you want to show status messages, use the logger.

orbeckst · 2025-06-26T22:17:50Z

package/MDAnalysis/analysis/msd.py

+        from collections import defaultdict
+        print("The Dump file has non-linear time spacing")
+
+        dump_times = [ts.time for ts in self._trajectory]


This iterates through the whole trajectory and may already be expensive!

Maybe unavoidable but keep it in mind.

orbeckst · 2025-06-26T22:18:47Z

package/MDAnalysis/analysis/msd.py

+        n_frames = len(dump_times)
+        n_atoms = self.n_particles
+        positions = np.zeros((n_frames, n_atoms, 3))
+        positions = self._position_array.astype(np.float64)


This overwrites positions. Delete the line above?? Not sure what the intent was here.

gitsirsha · 2025-06-30T22:36:21Z

Thank you @IAlibay @orbeckst and @mrshirts for your time to review the initial PR. Based on your suggestions I have modified a few things:

Added a keyword argument non_linear instead of a check like the initial commit is_linear.
non_linear is set to False and only set True by the user to call the new method _conclude_non_linear .
This is done to check if the GH Actions CI tests get cleared or not as the code executes to the default route.

The new method only executes when fft=False and non_linear = True
Removed the comments for documentation at the top. I will update that in the doc strings later once we are satisfied with the code.
Removed the print statements.
import collections at top of the file.
deleted positions = np.zeros((n_frames, n_atoms, 3)) as it was not necessary.

Necessity of a new method :

The _conclude_simple method calculated the time-averaged msd relying on the number of frames a particular dump file has but not relying on at what timestep did each frame got dumped at. So, a better method is needed to calculate "time-averaging" not "frame-averaging".
Moreover If the new method _conclude_non_linear is used on a linearly dumped file it generates same msd values as the default _conclude_simple method generates. This is because the new method solely uses the timestamps each frame is at which for a linearly dumped file is nothing but the frame index!.

Algorithm of _conclude_non_linear to calculate time-averaged msd :

It creates a dictionary/hash-map with the keys being the time difference between the frames Delta_t and values being a list containing the frame to frame MSD values [msd1, msd2, .....] for that particular Delta t Dictionary to collect MSDs: {Δt: [msd1, msd2, ...]}.
The reference frame is changed with each iteration and each time a new Key (i.e, Delta t) is found it is appended in the dictionary/hash-map.
If a duplicate Delta_t is found it is added to the value (msd list) of that specefic Key in the dictionary/hash-map.
Lastly in the dictionary/hash-map for each Delta_t the msd values inside the list are averaged.

Example Performance comparison:

A coarse grained LAMMPS trajectory with N=60,000 and 21 frames (non-linearly spaced timesteps) is run with fft turned off:

It took around 2.9s with the new method (toggled on using non_linear = True).
and took around 1.6s with the default method.

(This is not including the plotting time in the user interface)

mrshirts · 2025-06-30T22:50:20Z

The big picture here is that for polymer MSD's, the results are highly nonlinear in the amount of time (unlike with liquids, where the MSD becomes linear very quickly). In fact, generally they have a power law dependence. To calculate MSD at long log times ends up requiring a huge amount of data; you get the short time behavior very accurately, and long time behavior less accurately, but you don't really need the short time behavior more accurately than the long time behavior.

So LAMMPS has a standard option to output the trajectory files where every N steps, you output steps logarithmically: perhaps with spacing (roughly): 1, 10, 100, 1000. So for every N steps, you are only outputting 4 points. Then maybe your simulation is 100/N steps long, so ,you will get 100 data points for lags of 1, 10, 100, 1000, etc, and you get equal accuracy for all lags, at a cost of some precision at short times, but using way less storage space (polymer simulations can have a lot of particles!). Polymer scientists use this a lot.

The issue is that the standard code can't handle this, and I'm actually not sure if it can be handled with FFTs. So I'm not totally sure if it makes sense to have a single code path (although that would be desirable). What one should have is that one gets the same answer for LINEAR data (within machine precision) if it goes through either path, which can be automatically checked, so at least any code duplication is validated.

gitsirsha added 3 commits June 13, 2025 17:43

Inditial addition of code to calculate time-averaged msd for non-line…

045f91f

…ar dump

updated_documentation_v1_plots_removed

3462365

Added_name_in_the_package/AUTHORS

f6e5ddf

github-actions bot reviewed Jun 15, 2025

View reviewed changes

mrshirts reviewed Jun 18, 2025

View reviewed changes

IAlibay reviewed Jun 19, 2025

View reviewed changes

orbeckst requested changes Jun 26, 2025

View reviewed changes

addressed_review_comments_mainly_removed_linearity_check

a7938f9



		def _conclude_modified(self):
		from collections import defaultdict

Feature non linear time averaged msd #5066

Are you sure you want to change the base?

Feature non linear time averaged msd #5066

Uh oh!

Conversation

gitsirsha commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Developers Certificate of Origin

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mrshirts left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IAlibay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orbeckst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gitsirsha commented Jun 30, 2025

Uh oh!

mrshirts commented Jun 30, 2025

Uh oh!

Uh oh!

gitsirsha commented Jun 15, 2025 •

edited

Loading

codecov bot commented Jun 15, 2025 •

edited

Loading