Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsuccessful read with SS-SUCESS exception using Tree.getNode(...).getData() in Python #2726

Open
alkhwarizmi opened this issue Mar 8, 2024 · 31 comments
Assignees
Labels
api/python Relates to the Python API branch/alpha This is present on or relates to the alpha branch branch/stable This is present on or relates to the stable branch bug An unexpected problem or unintended behavior core Relates to the core libraries and scripts

Comments

@alkhwarizmi
Copy link
Contributor

Dear all,

I don't feel comfortable to debug the following puzzling condition we recently discovered at our site. The problem is not blocking so far. For unkown reason our control software (mainly written in Labview) keeps going. However I find this quite alarming.

I need some help for debug and fix, if possible.

Thank you very much in advance.

Sincerely,

Affiliation
Eurac Research, Institute for Renewable Energy, Heat Pumps and Energy Exchange Laboratories

Version(s) Affected
Client side: 'stable_release_7.132.0' (my PC), 'alpha_release_7.139.8' (colleague of mine PC)
Server Side: 7.50.1 (Windows), 7.132.0 (GNU/Linux)

Platform
Client side Windows, server side Windows and GNU/Linux

Describe the bug
We recently discovered that access to some past run data seems to be broken.

To Reproduce
Steps to reproduce the behavior:

From my colleague PC towards Windows server:

tree = Tree('hplabon', 1934)
 
n = tree.getNode('.CONFIG.WAVEFORMS.OUTPUTS.WFRM00:SIG')
 
n
Out[21]: .CONFIG.WAVEFORMS.OUTPUTS.WFRM00:SIG
 
n.getData()
Traceback (most recent call last):
 
  Cell In[22], line 1
    n.getData()
 
  File C:\MDSplus\python\MDSplus\tree.py:1946 in getRecord
    raise _exc.MDSplusException(status)
 
SsSUCCESS: %SS-W-SUCCESS, Success

Today from my PC towards Windows server, same run number:

from MDSplus import Tree
Issues loading MdsShr, trying find_library

tree = Tree('hplabon', 1934)

n = tree.getNode('.CONFIG.WAVEFORMS.OUTPUTS.WFRM00:SIG')

d = n.getData()
Traceback (most recent call last):

  Cell In[4], line 1
    d = n.getData()

  File C:\MDSplus\python\MDSplus\tree.py:1946 in getRecord
    raise _exc.MDSplusException(status)

TreeNCIREAD: %TREE-E-NCIREAD, Error reading node characteristics from file.

Another example also tested with jTraverser, same run number different node (WFRM00 is broken also in the jTraverser):

from MDSplus import Tree
Issues loading MdsShr, trying find_library

Error in GetAnswerInfoTS: mode = 4, status = 0

tree = Tree('hplabon', 1934)

n = tree.getNode('.CONFIG.WAVEFORMS.OUTPUTS.WFRM09:SIG')

d = n.getData()
Traceback (most recent call last):

  Cell In[4], line 1
    d = n.getData()

  File C:\MDSplus\python\MDSplus\tree.py:1946 in getRecord
    raise _exc.MDSplusException(status)

SsSUCCESS: %SS-W-SUCCESS, Success


d
Traceback (most recent call last):

  Cell In[5], line 1
    d

NameError: name 'd' is not defined

Today from my PC towards GNU/Linux server, which is OK (yesterday it was not, if I am not mistaken):

from MDSplus import Tree
Issues loading MdsShr, trying find_library

tree = Tree('hplab', 2104)

n = tree.getNode('.CONFIG.WAVEFORMS.OUTPUTS.WFRM00:SIG')

d = n.getData()

d.getData()
Out[5]: [0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.]

d
Out[6]: Build_Signal(0, [0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,1.,1.,0.], [], [0Q,1000000Q,1000000Q,1080000Q,1080000Q,1300000Q,1300000Q,1634000Q,1634000Q,2800000Q,2800000Q,2880000Q,2880000Q,4660000Q,4660000Q,4740000Q,4740000Q,5500000Q,5500000Q,5580000Q,5580000Q,6400000Q,6400000Q,6480000Q,6480000Q,7300000Q,7300000Q,7380000Q,7380000Q,8200000Q,8200000Q,8280000Q,8280000Q,10000000Q,10000000Q,10080000Q,10080000Q,13600010Q,13600010Q,13650010Q,13650010Q,17200000Q,17200000Q,17280000Q,17280000Q,18100000Q,18100000Q,18180000Q,18180000Q,21700020Q,21700020Q,21790020Q,21790020Q,28000020Q,28000020Q,28080020Q,28080020Q,31600020Q,31600020Q,31680020Q,31680020Q,35200000Q,35200000Q,35280000Q,35280000Q,40600000Q,40600000Q,40680000Q,40680000Q,41500040Q,41500040Q,41550040Q,41550040Q,42400040Q,42400040Q,42450040Q,42450040Q,44200040Q,44200040Q,44280040Q,44280040Q,49600040Q,49600040Q,49811040Q,49811040Q,52300040Q,52300040Q,52380040Q,52380040Q,53200040Q,53200040Q,53534040Q,53534040Q])

Expected behavior
Like the third example in all conditions.

Screenshots
Access via think client (new jTraverser seems to provide different results)
image

image

Additional context

  1. MDS+ files are moved from the online server (Windows) to the offline server (GNU\Linux on a weekly basis)
  2. First time I load MDSplus in Python I get a message (see above)
  3. We need to work with both the 64 and 32 bit version at the same time (due to link against LabView executable)
@alkhwarizmi alkhwarizmi added the bug An unexpected problem or unintended behavior label Mar 8, 2024
@mwinkel-dev mwinkel-dev added branch/stable This is present on or relates to the stable branch api/python Relates to the Python API branch/alpha This is present on or relates to the alpha branch core Relates to the core libraries and scripts labels Mar 9, 2024
@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi -- Thanks for the detailed bug report. Would appreciate a few more details, as that will aid us in reproducing and debugging the issue. Specifically would like to know the type (Windows, Ubuntu Linux, RedHat Linux, etc.) and version of the operating system on the two client PCs. And same thing regarding the two servers.

Also note that the SsSUCCESS: %SS-W-SUCCESS, Success message is misleading. The W means "warning", and thus this is actually an error condition.

@alkhwarizmi
Copy link
Contributor Author

Thank you very much for the support.

Here are information you asked for:

Regarding my PC:

{'platform': 'Windows',
'platform-release': '10',
'platform-version': '10.0.19045',
'architecture': 'AMD64',
'hostname': ,
'ip-address': ,
'mac-address': ,
'processor': 'Intel64 Family 6 Model 140 Stepping 1, GenuineIntel',
'ram': '32 GB'}

Colleague of mine PC:

{'platform': 'Windows',
'platform-release': '10',
'platform-version': '10.0.19045',
'architecture': 'AMD64',
'hostname': ,
'ip-address': ,
'mac-address': ,
'processor': 'Intel64 Family 6 Model 142 Stepping 9, GenuineIntel',
'ram': '8 GB'}

Regarding the GNU/Linux server:
{'platform': 'Linux',
'platform-release': '4.15.0-213-generic',
'platform-version': '#224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023',
'architecture': 'x86_64',
'hostname': ,
'ip-address': ,
'mac-address': ,
'processor': 'x86_64',
'ram': '8 GB'}

Regarding the Windows server:
{'platform': 'Windows',
'platform-release': '10',
'platform-version': '10.0.19041',
'architecture': 'AMD64',
'hostname': ,
'ip-address': ,
'mac-address': ,
'processor': 'Intel64 Family 6 Model 71 Stepping 1, GenuineIntel',
'ram': '8 GB'}

@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi -- Thanks for the information. Based on that, I will use Windows 10 for my testing.

Regarding the Linux server, please login to the computer and at the shell prompt type cat /etc/os-release. That should show the flavor of Linux that is installed on the server (Ubuntu, RedHat, Debian, CentOS, Fedora, whatever). Normally, I use Ubuntu for my testing. But if your Linux system is on RedHat or some other distribution, I will do testing with that distribution instead.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Mar 11, 2024

Hi @alkhwarizmi -- The initial bug report indicates that this issue only exists with old data. Thus, here are some related questions:

  • When were the old MDSplus trees created (i.e., which year)?
  • What version of MDSplus was used to create those old trees?
  • Are all trees created on the Windows 10 server using 32-bit MDSplus?
  • Do the Windows 10 client PCs use 32-bit or 64-bit MDSplus?
  • Are there any problems accessing new data (e.g., created a week or two ago)?
  • Have there been any recent changes to the servers and client PCs?

It is also interesting that the Windows 10 client PCs have this warning: Issues loading MdsShr, trying find_library.

@mwinkel-dev mwinkel-dev self-assigned this Mar 13, 2024
@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi -- I am configuring some systems so I can investigate this issue. Will post here when I have succeeded in reproducing the error.

@alkhwarizmi
Copy link
Contributor Author

Hi @alkhwarizmi -- The initial bug report indicates that this issue only exists with old data. Thus, here are some related questions:

  • When were the old MDSplus trees created (i.e., which year)?
  • What version of MDSplus was used to create those old trees?
  • Are all trees created on the Windows 10 server using 32-bit MDSplus?
  • Do the Windows 10 client PCs use 32-bit or 64-bit MDSplus?
  • Are there any problems accessing new data (e.g., created a week or two ago)?
  • Have there been any recent changes to the servers and client PCs?

It is also interesting that the Windows 10 client PCs have this warning: Issues loading MdsShr, trying find_library.

Some comments:

  1. In general we do not update the software so much. Our "clients", I mean the software running on our PC are updated more frequently. The "servers" practically never (one is still 7.50).
  2. The tree is created using the thick client (class Tree) from my or the colleague of mine PC connecting to the WIndows server
  3. The windows server is also the Plant Controller, which in the ends fills the nodes with data.
  4. We copy the tree files later on the cold storage (the GNU\Linux server) without involving MDS+ libraries
  5. The run 1934 was created Fri Apr 21 2023 09:33:07 GMT+0000
  6. We don'T record the version of MDS+ used to create the tree. That could be useful information
  7. WHat may be relevant is that:
    a. when I create the tree I use Python 64which should link to the 64-bit version of the library.
    b. My colleague had a lot of problems with earlier versions of MDS+ installer which was not handling correctly the two versions. However, our cripts generating the tree are in python and Python is 64 bits. So if he run it must have used 64 bit versio n I guess.
    c. The program writing data in the database is coded in LabView, and we are stuck with the 32 bit version for compatibility with the hardware we have. This program should link with the 32-bit version of MDS+.
    d. Even of our PC we need both 32 and 64 version because we build\run both LAbView and Python software
  8. I tried creating locally an experiment tree on my PC. This seems to not generate any problem.

@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi -- Thanks for the additional detail.

Current conjecture is that this is a cross-version compatibility issue (i.e., client is newer than the Windows server). Reason for this conjecture is as follows:

  • Trees created on Win10 server with stable_7.50.1 work fine when moved to your Linux archive server (stable_7.132.0) and accessed with a relatively new client (Win10 stable_7.132.0)
  • Access from new Win10 clients (stable_7_132.0 and alpha_7.139.8) throw errors when accessing the Win10 server (stable_7.50.1)
  • And the Win10 server is running an MDSplus version that is ~4 years old (stable_7.50.1 was released around 15-May-2019).

Note that the networking protocol that is part of MDSplus, namely mdsip, does evolve over time. And thus as with any networking protocol, there must be a compatible pair of client and server versions in order for communication to work correctly.

To test the above conjecture, I am setting up a server with stable_7.50.1 and a client with stable_7.132.0 and seeing if I can reproduce the errors described in the bug report. When I am able to reproduce the error, I will then be able to provide the following:

  • the minimum version the Win10 server must be upgraded to in order to be compatible with the new clients, or
  • the minimum version the Win10 clients must be downgraded to in order to be compatible with the ~4 year old Win10 server, or
  • whether there is a fix that can be made to the mdsip protocol to extend the client/server compatibility to cover the versions of MDSplus in use at your site.

@alkhwarizmi
Copy link
Contributor Author

Thank you very much for the help.

First a question then I tell you what I would do.

Q: This incompatibility regards how data is transferred I guess, not at all how data is stored, right? So, past data will be forever readable with a newer client.

If the answer to the above question is yes, then I would not develop any hack to try to maintain compatibility with 7.50. We will try to update the server. The point is that we are very much scared of everything that could go wrong... It is a production system, any delays costs quite a lot and we need to plan the upgrade in advance. I cannot make at will.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Mar 13, 2024

Hi @alkhwarizmi,

You are correct. If the mdsip incompatibility conjecture is correct, then it only affects communication between the client and server (and vice versa). It is probable that the error is being thrown even before the server attempts to return data to the client.

An easy way to determine if you site has encountered a mdsip compatibility issue is to do the following:

  • Find a very old tree on your servers (doesn't matter whether you use the Win10 server or the Linux archive server).
  • Copy that tree to your Win10 client PC.
  • On your Win10 client PC, see if you can open the tree and read the nodes of interest using one of the MDSplus utilities (mdstcl, jTraverser, jTraverser2, jScope, Python API, MATLAB, etcetera).
  • If you can successfully open the tree locally without any error, then surely the issue is a version mismatch on mdsip (clients are newer than the server).

Many sites that use MDSplus also update their server side software infrequently. And for the same reasons (i.e., don't want to break a working system, the expense of disrupting or delaying scheduled experiments, the effort involved in testing a new version before placing it in production and so forth).

While you are considering whether to upgrade your Win10 server to a newer version, I will continue to investigate the issue. If I am able to reproduce the errors, then you will have the facts needed to proceed with your upgrade plans.

@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi,

More questions . . .

  • What version of Python is running on your Win10 client PC?
  • Are you running Python directly (i.e., not having LabView or any other application run Python for you)?
  • Does your Python program just do a "create pulse" to copy the model file to generate each shot?
  • Or is your Python program actually constructing the tree for each shot (i.e., adding nodes, building the hierarchy)?

For my first test of the issue, everything worked fine. Did not reproduce the errors shown in the bug report.

Here is the configuration that was used:

  • Server: stable_7.50.1 on Ubuntu18
  • Client: stable_7.132.0 on Ubuntu22
  • mdsip: thick-client
  • tree was created on the server using mdstcl
  • tree has two signals, SIG1 (4 integer elements) and SIG2 (1 integer element)
  • Python on client was able to display both SIG1 and SIG2 without problem
  • jTraverser2 on client was also able to display both SIG1 and SIG2 correctly

Will now repeat with a signal that has floats and quads (as per the screenshot in the initial bug report). And if that doesn't reproduce the error, will then switch to a Win10 server with stable_7.50.1 version.

I have also examined the code in the MDSplus API for Python and it definitely is detecting an error.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Mar 15, 2024

Hi @alkhwarizmi -- Was unable to reproduce the errors by repeating the above experiment using signals that contained float data and quadword dimension. And those signals also displayed fine using jTraverser2 on the client.

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Mar 15, 2024

Hi @alkhwarizmi,

More questions . . .

  • Does the Win10 server have both 32-bit and 64-bit MDSplus installed?
  • Do the two Win10 client PCs also have 32-bit and 64-bit MDSplus?
  • What was the result of taking an old tree, copying it to your client PC and opening it locally?
  • And what happened when copying the old tree to your colleague's client PC and opening it locally?

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Mar 15, 2024

Hi @alkhwarizmi,

It appears that Python for Windows does have a 32-bit version. If so, that would simplify the configuration of your Win10 computers as you would only need the 32-bit version of MDSplus. (Unless your site has a Python program that must use a 64-bit library to analyze data read from MDSplus.)

https://www.python.org/downloads/windows/
https://docs.python.org/3/using/windows.html#

@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi,

The initial bug report states that jTraverser isn't displaying the expected output (i.e., that it differs from the Python output).

Some questions about that observation . . .

  • Were you comparing jTraverser and Python on the same client PC accessing the same server, tree and node?
  • Is the difference simply because the jTraverser screenshot shows node, WFRM09:SIG, but the Python output (from your PC to the Linux server) is accessing a different node, WFRM00:SIG?

@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi,

Switched to Windows 10 for the server and was unable to reproduce the error.

Here was the configuration:

  • Server: 64-bit Win10 with 64-bit MDSplus stable_7.50.1
  • Client: Ubuntu 22 with MDSplus stable_7.132.0
  • mdsip: thick-client
  • tree was created on the server and had a signal with 4 elements (data = floats, dimension = quadwords)
  • client was able to retrieve the data using mdstcl and Python

Next will configure the Windows server with 32-bit MDSplus.

@zack-vii
Copy link
Contributor

zack-vii commented Mar 16, 2024

Hi there,
since there seems to be a lot of variables. is it possible to get access to that tree (subtree) in question? if this issue occurs on multiple trees, maybe you can share the shot with the smallest size. The Ss-W-Success (0) is not used by MDSplus but rather due to regular c code that tends to return 0 on success. we replaced those on most places, possibly not all with MDSplusSUCCESS. To find the cause for the different reads, it would be best to understand why different versions takebdifferent paths. once that is understood one could possibly isolate the issue that may have happened on write.

@zack-vii
Copy link
Contributor

zack-vii commented Mar 16, 2024

3. The windows server is also the Plant Controller, which in the ends fills the nodes with data.

I would advise against windows as the finsl tree host as its native file systems do not support partial file locking on system level. this makes it slow and less useful for mulltithreaded/multiprocessed writes. At W7X we made a lot of tests on how to store data efficiently and found it is best to:

  • have a subtree for each component.
  • write data locally on each component
  • afterwards transfer the entire file set to the archive server
  • have archive server hold the master tree that links all component
  • setup tree_paths on the server such that it also look for trees on the component servers allowing clients to read data during and after shot with a single path configuration. i.e. 'default_tree_path=mdsplus-server::'.

This will limit the amount of concurrent writes to a file and the number of sources for a write issue.

@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi,

My newest conjecture is that this issue might be associated with running both 32-bit and 64-bit MDSplus on the same Windows 10 computer. I will do some experiments with that configuration in the coming week.

To determine if your trees are undamaged, probably best to login to the Linux server and use its MDSplus and Python to check the old trees. As that will eliminate all the variables associated with your site's unusual MDSplus installation on Windows 10.

Hi @zack-vii,

Thanks for the tips. Much appreciated!

@mwinkel-dev
Copy link
Contributor

Hi @zack-vii,

Regarding the C code, it does indeed still have return 0 statements. A few days ago, I was looking at the source code for the read operations (as shown in the bug report) that threw the SS-W-SUCCESS exception. The return 0 occurs 2 times in treeshr/TreeGetRecord.c and 5 times in treeshr/RemoteAccess.c. (And likely also occurs elsewhere in the treeshr code -- I haven't checked yet.)

This is an interesting bug because of the cross-version aspect (i.e., multiple versions of MDSplus spanning a ~4 year period), plus 32-bit and 64-bit versions installed on the same computers (needed for 32-bit LabVIEW and 64-bit Python).

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Mar 18, 2024

Hi @alkhwarizmi,

I was wrong. Turns out that when doing an install of MDSplus on 64-bit Windows computers, both the 32-bit and 64-bit MDSplus *.dll files are installed. (I was surprised to see that when I examined the configuration script for the installer.). And I have confirmed that for stable-7.50.1, the 32-bit *.dll files end up in C:\Windows\SysWOW64 and the 64-bit *.dll files end up in C:\Windows\System32.

So now, I will see what happens if I use mdsip with the 32-bit *.dll files.

@alkhwarizmi
Copy link
Contributor Author

Hallo thank you for the work and sorry for the late reply.

  • What version of Python is running on your Win10 client PC?

I don't think this is relevant because the real-time controller make use only of the 32-bit LabView API. We don't log there to run the tree creation scripts (I know it would be faster, but I wanted to avoid problems due to machine size altogether) when we need to update the model tree we re-create it from our PC using the Tree class. However we do have Python installed there, which is quite old (Python 3.6.8 from Anaconda3 distribution) precisely because we don't use it.

  • Are you running Python directly (i.e., not having LabView or any other application run Python for you)?

When we run the script to re-create the model we normally run in from a shell within the Spyder IDE

  • Does your Python program just do a "create pulse" to copy the model file to generate each shot?

The plant control software (PCS) running on the Windows server (which is a National Instruments PXI rack) is the entity that creates new runs from the model when the operator decides to start a new experiment (which I call "run"). This is done by the PCS using the LabView API 32-bit of the "create pulse"

  • Or is your Python program actually constructing the tree for each shot (i.e., adding nodes, building the hierarchy)?

No, this is done sometimes when we need to apply structural changes to the tree, and it is not done by the PCS software itself but by our management scripts written mostly Python but we have also LabView "scripts". As I said above, are not even run from the Windows server but from our PC.

Cheers,

@alkhwarizmi
Copy link
Contributor Author

3. The windows server is also the Plant Controller, which in the ends fills the nodes with data.

I would advise against windows as the finsl tree host as its native file systems do not support partial file locking on system level. this makes it slow and less useful for mulltithreaded/multiprocessed writes. At W7X we made a lot of tests on how to store data efficiently and found it is best to:

  • have a subtree for each component.
  • write data locally on each component
  • afterwards transfer the entire file set to the archive server
  • have archive server hold the master tree that links all component
  • setup tree_paths on the server such that it also look for trees on the component servers allowing clients to read data during and after shot with a single path configuration. i.e. 'default_tree_path=mdsplus-server::'.

This will limit the amount of concurrent writes to a file and the number of sources for a write issue.

Hallo zack,

thaks for the tip. At the moment from the point of viewof MDS+ we don'T have any subsystems because our distributed IO does not send independently data to the central node (Windows Server, aka PXI). Instead the PCS collects itself data from the actual subsystems and then writes the data into the tree in a very linear for loop. In other words we have only one writer.

As the system grew with time and we are starting to experience latency problems I am courios to know your opinion about parallelizing the MDS+ writes in different threads to make them faster. However I think that we will still have one writer.

Cheers,

@alkhwarizmi
Copy link
Contributor Author

Regarding the Python versions:

PC of mine: 3.11.5
Colleague of mine PC: 3.8.10
Windows Server (IMHO irrelevant): 3.6.8
GNU/Linux server (IMHO irrelevant): 3.6.9

Cheers,

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Mar 20, 2024

Hi @alkhwarizmi,

Three topics: your old trees, your recent posts, and my continuing experiments.

Old Trees
Have you been able to confirm that your old trees are not damaged? That if you login to the Linux server and use its MDSplus tools / APIs, that you can retrieve all the old data? (I hope that all of your data is intact and nothing has been lost.)

Your Recent Posts
Thanks for the additional information about your site's configuration. Now that I have a better understanding of the workflow, I concur that the Python version on the Windows and Linux server is irrelevant.

My understanding is that the Windows 10 server is only running 32-bit MDSplus, correct?

And that the problems your colleague encountered installing 32-bit and 64-bit MDSplus only applies to the two Win10 client PCs, correct? And that the 32-bit version is from the same MDSplus release as the 64-bit version, correct?

I am curious to know what your colleague did to workaround the installer issue (that was mentioned in your post of 13-Mar-2024). If you use Windows Explorer to search your entire C:\ drive for MdsShr.dll how many copies does it find? And what directories are they in?

My Experiments
I am still configuring my system to have 64-bit MDSplus on my client communicate via mdsip with 32-bit MDSplus on my server. I will be working on that experiment later today.

@alkhwarizmi
Copy link
Contributor Author

Hallo,

first some answer to your last post:

My understanding is that the Windows 10 server is only running 32-bit MDSplus, correct?

yes

< And that the problems your colleague encountered installing 32-bit and 64-bit MDSplus only applies to the two Win10 client PCs, correct?

yes

And that the 32-bit version is from the same MDSplus release as the 64-bit version, correct?

yes

I am curious to know what your colleague did to workaround the installer issue (that was mentioned in your post of 13-Mar-> 2024). If you use Windows Explorer to search your entire C:\ drive for MdsShr.dll how many copies does it find? And what
directories are they in?

Ahh, you mean 7.b... Can't answer now, I think he had to change the system path variable every time he had to switch operations between Python and LabView. However, I also experienced the probem and my method to overcome it was "wait and pick a newer MDS+ installer" in the hope it did solve the problem, which is what in fact happened. :) So I did not raise anything here, thinking that this problem was already solved (which I think it is).

Instad, regarding to the original point (old trees), I was about to confirm things and I have stumped into a very surpising "feature" that could even point to some unexpected problem. Here is how the story goes.

I copied the 3 old data files referring to run 1934 from the Windows server, placed them on my local test tree and renamed appropriately.
Then I did access the run through the Spyder shell and saw that it was actually working.
Then I started to collect everything in a script, shown below, to properly report in this thread.
The purpose of the script was to show that the remote access raised the exception while the local access actually worked fine.

import os

from MDSplus import Tree


print(f"test_path = {os.environ['test_path']}")


nodePath = '.CONFIG.WAVEFORMS.OUTPUTS.WFRM13:SIG'

print("Testing remote access to hplabon machine.")
print(f"hplabon_path = {os.environ['hplabon_path']}")

tree = Tree('hplabon', 1934) # Remote 

node = tree.getNode(nodePath)

print(f"node = {node}")

try:
    
    data = node.getData(node)
    print(f"data = {data}")
except Exception as e:
    
    print(f"Received exception {e} while accessign node {node} on tree {tree}")
    
finally:
    
    tree.close()
    
print("Testing remote access to hplabon machine.")
print(f"test_path = {os.environ['test_path']}")

tree = Tree('test', 1934) # Local same data files copied and renamed 

node = tree.getNode(nodePath)

print(f"node = {node}")

try:
    
    data = node.getData(node)
    print(f"data = {data}")
except Exception as e:
    
    print(f"Received exception {e} while accessign node {node} on tree {tree}")
    
finally:
    
    tree.close()

With my great surprise, however, the script did actually access correctly all the data, as you can see from the output below.

runfile('C:/Development/PCS/Plant-Control-System/PCS-Core/Python/src/tests/debugMDSIPProblem.py', wdir='C:/Development/PCS/Plant-Control-System/PCS-Core/Python/src/tests')
Issues loading MdsShr, trying find_library
Testing remote access to hplabon machine.
hplabon_path = <hidden>:8000::C:\MDSplusTrees\hplabon
node = \HPLABON::TOP.CONFIG.WAVEFORMS.OUTPUTS.WFRM13:SIG
data = Build_Signal(0, [0.,0.], [], [0Q,3600000Q])
Testing remote access to hplabon machine.
test_path = C:\MDSplusTrees\test
node = \TEST::TOP.CONFIG.WAVEFORMS.OUTPUTS.WFRM13:SIG
data = Build_Signal(0, [0.,0.], [], [0Q,3600000Q])

Later I found out that how to reproduce the bug, which can be done just run the above commands one by one in the shell !!!

tree = Tree('hplabon', 1934) # Remote

node = tree.getNode(nodePath)

data = node.getData(node)
Traceback (most recent call last):

  Cell In[4], line 1
    data = node.getData(node)

  File C:\MDSplus\python\MDSplus\tree.py:1946 in getRecord
    raise _exc.MDSplusException(status)

SsSUCCESS: %SS-W-SUCCESS, Success

Moreover, I noticed that sometimes, after a kernel reset, the script behaves unpredictably generating errors, for example (sometimes errors are even more):

runfile('C:/Development/PCS/Plant-Control-System/PCS-Core/Python/src/tests/debugMDSIPProblem.py', wdir='C:/Development/PCS/Plant-Control-System/PCS-Core/Python/src/tests')
Issues loading MdsShr, trying find_library
Testing remote access to hplabon machine.
hplabon_path = 10.21.240.242:8000::C:\MDSplusTrees\hplabon
node = \HPLABON::TOP.CONFIG.WAVEFORMS.OUTPUTS.WFRM13:SIG
data = Build_Signal(0, [0.,0.], [], [0Q,3600000Q])
Testing remote access to hplabon machine.
test_path = C:\MDSplusTrees\test
node = \TEST::TOP.CONFIG.WAVEFORMS.OUTPUTS.WFRM13:SIG
data = Build_Signal(0, [0.,0.], [], [0Q,3600000Q])

Error in SendArg: mode = 6, status = 65554
D, 1711098078.714:  buffer_free()                 Connection(id=2, state=0x80, protocol='tcp', info_name='tcp', version=0, user='(null)')

My conclusion is that maybe something on the Python interface is not working as intended or that network parameters (like timeouts) affects the communication. I see a possible wrong interaction among different multi thread or multi process libraries.

Let me know what do you think and thanks again for the support!

Cheers,

@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi -- Thank you for the additional information! You have found an important clue. I will install the Spyder IDE and see if I can reproduce the problem.

Now for the details . . .

Old Trees
Very glad to read that your old trees are intact. It is excellent news that no data has been lost.

New Conjecture
Newest conjecture is that the Spyder IDE uses threading. Unfortunately, the mdsip protocol (part of MDSplus) supports multi-process access to remote servers, but not multi-threaded access. The mdsip protocol maintains a lot of internal state per connection. It is a known issue that multi-threaded access can cause mdsip to misbehave. Typical pattern is that it initially works, but as the client adds additional worker threads accessing the remote server, then the mdsip protocol eventually fails and throws exceptions. Conjecture is that typing Python statements one at a time in different Spyder IDE "cells" involves threading (or some subtle change in context) that is not compatible with the mdsip protocol.

Workaround
If the above conjecture turns out to be true, there are two workarounds:

  • Run the entire Python script in a single Spyder IDE "cell" (i.e., the approach you used in your previous post).
  • Skip the Spyder IDE and run the Python interpreter directly (i.e., python3 some_analysis.py).

@alkhwarizmi
Copy link
Contributor Author

Hallo,

so it seems to make sense to me.

For the moment the option at our site are to run the scripts in a non-interactive environment (no ipython) or using the "run" command. We also have to option to upgrade the software from 7.50 to a more recent version. I think this is recommended but is not really the source of the problem.

In the long run, however, I also think that it is an opportunity for MDS+ community to upgrade mdsip to support multi-threaded applications.

Sincerely,

@mwinkel-dev
Copy link
Contributor

Hi @alkhwarizmi,

You are correct that mdsip should be upgraded to be thread safe (regarding operating system threads and CPython threads). The lack of thread safety has caused us some headaches in our research too. We are well aware of the problem, however rewriting mdsip is a big project that will take a lot of time and testing.

You are also correct that upgrading your site's Windows server to a new version (instead of the existing stable_7.50.1) will not solve the threading issue you have encountered. Nonetheless, the stable_7.50.1 version is ~4 years old, and thus it would be good to start planning to upgrade MDSplus on your site's Windows server.

Even though you have identified the root cause of the problem, I will keep this issue open until I do the following:

  • run some remaining tests with 32-bit mdsip,
  • reproduce the problem in the Spyder shell and/or the iPython shell, and
  • create some related "feature request" issues on GitHub.

@zack-vii
Copy link
Contributor

zack-vii commented Mar 25, 2024

Actually mdsip does support a kind of mutli-threaded access. The thing is it had to be backwards compatible and hence one could not simply create a connection per thread automatically. Nonetheless, this would still require server side support for system wide ranged file locking (like OFD locks) as the mdsip may spawn processes to serve multiple connections.
That said, you need to explicitly open a thread private connection. This is done in some of the python device drivers using the Tree.copy() method. That did work back than and should still, although i cannot find a device example atm.

edit: according to the TestDevice's comment:

# make thread safe copy of device node with a non-global context

The copy is used to transform a global tdi context into a local context. The way it is used here would not create a thread private connection.
Anyway, mdstcpip/mdsipshr/mdsipthreadstatic.h still suggests that we may have thread private connections.

@mwinkel-dev
Copy link
Contributor

Hi @zack-vii -- Thanks for the additional details. I will follow up with Josh.

@mwinkel-dev
Copy link
Contributor

Client = Ubuntu 20 with recent MDSplus; Server = same. A quick test with ipython3 was able to do node.getData() for a signal without triggering the error.

Will repeat this experiment on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api/python Relates to the Python API branch/alpha This is present on or relates to the alpha branch branch/stable This is present on or relates to the stable branch bug An unexpected problem or unintended behavior core Relates to the core libraries and scripts
Projects
None yet
Development

No branches or pull requests

3 participants