Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some calling routines do not properly handle SsINTERNAL, C_ERROR, and/or FALSE returned by low-level routines #2741

Open
mwinkel-dev opened this issue Apr 14, 2024 · 4 comments
Assignees
Labels
bug An unexpected problem or unintended behavior

Comments

@mwinkel-dev
Copy link
Contributor

mwinkel-dev commented Apr 14, 2024

Affiliation
MIT PSFC

Version(s) Affected
all

Platform
all

Describe the bug
The entire MDSplus code base uses different standards for error codes.

  • on Linux, calls to the operating system functions usually return -1 to indicate an error (i.e., access errno for the code)
  • some MDSplus functions can return the macros C_OK (=0) and C_ERROR (= -1)
  • others return TRUE (=1) or FALSE(=0)
  • some return B_TRUE (=1) or B_FALSE (=0)
  • and several return SsINTERNAL (= -1)
  • plus many return MDSplus codes, such as MDSplusSUCCESS and MDSplusERROR (there are many more too)

Often the status returned by a function is immediately checked with one of the following "define" macros.

  • IS_OK
  • STATUS_OK
  • IS_NOT_OK
  • STATUS_NOT_OK

However, the above macros only work with MDSplus codes, which consist of a 32-bit integer containing 3 fields (facility ID, message ID, and severity code). Note that the low-order bit is important. Success = 1 in the bit, Failure = 0 in the bit.

The problems arise when non-MDSplus codes are passed to the macros. Specifically, the -1 of SsINTERNAL, C_ERROR or an operating system error status. The value -1 = 0xFFFFFFFF has the low-order bit set, thus the macros will incorrectly treat -1 as a flavor of success. Which can cause error handling code to produce different results than what was intended when the code was originally written.

This mix of error code conventions is a systemic source of errors in the entire code base. It would probably be worthwhile to at least make sure that all routines that return SsINTERNAL or C_ERROR, are in turn called by functions that properly detect those return values and then map them into the appropriate MDSplus error codes.

To Reproduce
This is an example that was spotted during the investigation of Issue #2731 and PR #2740. The low-level routine, send_bytes() can return SsINTERNAL. And potentially, it can ripple up the call stack all the way to the top level of mdstcl.

The following links are to lines of code starting in send_bytes() and then climbing up the call stack.
send_bytes()
SendMdsMsgC()
SendArg()
- bypasses error handling
GetAnswerInfoTS() can also return SsINTERNAL
- bypasses error handling
ServerSendMessage()
- bypasses error handling
ServerDispatchAction()
ServerDispatchPhase()
- bypasses error handling
TclDispatch_phase()
- bypasses error handling

Because the macros treat SsINTERNAL as success, the error bypasses several error handling sections in the call chain. (The analysis of GetAnswerInfoTS is not shown above, but is similar to that for SendArg.)

Expected behavior
Any function that can return SsINTERNAL, C_ERROR or other non-MDSplus codes, should be followed by a status check that also includes those values. In the above example, the following changes should be made.

  • if (STATUS_OK) becomes if ((status != SsINTERNAL) && STATUS_OK)
  • if (STATUS_NOT_OK) becomes if (status == SsINTERNAL) || STATUS_NOT_OK)

The above example is the fifth or so instance of this bug that I have seen in the MDSplus source code. There are undoubtedly more bugs like this lurking in the source code. However, they are likely in seldom exercised error handling code sections. (Bugs like this that affected the mainstream workflow would have been reported by customers years ago and are surely already fixed.)

Screenshots
n/a

Additional context
There are three strategies for dealing with this issue.

  1. Don't Propagate -- Don't allow non-MDSplus error codes to ripple up the entire call chain. Limit it to just the first calling routine. Find all instances of SsINTERNAL and C_ERROR in the source code. Create a list of all the associated functions. Then create a second list of all routines that call the functions on the first list. Make sure that all functions on the second list correctly process SsINTERNAL and C_ERROR and only return MDSplus error codes. This is the safest approach, but also the most time consuming to implement.
  2. Redefine Macros -- Change the definition of the macros (IS_OK, IS_NOT_OK, etc) to also check for SsINTERNAL and C_ERROR. A simple and easy change, but is riskier because it is a global change.
  3. Redefine SsINTERNAL -- Redefine SsINTERNAL to be -2 (i.e. 0xFFFFFFFE) so that if it is inadvertently passed into the macros, it will be treated as a failure and not as a success. (This is the same suggestion made in the comments on PR 2740, Fix: reduce open files due to dispatcher #2740 (comment) .) Also simple and easy, but risky because it is a global change.

The "Don't Propagate" strategy should be feasible because the number of functions is fairly small.

  • SsINTERNAL occurs 36 times, but in just 13 files
  • C_OK is 23 times, 14 files
  • C_ERROR is 67 times, 14 files
  • B_TRUE is 20 times, 12 files
  • B_FALSE is 36 times, 13 files
  • FALSE and TRUE are too numerous to inspect each occurrence
@mwinkel-dev mwinkel-dev added the bug An unexpected problem or unintended behavior label Apr 14, 2024
@mwinkel-dev
Copy link
Contributor Author

If the goal of SsINTERNAL is to alert the user to an extremely abnormal condition that should never arise during normal operation of MDSplus, then another approach is to throw an exception immediately. Instead of functions using return SsINTERNAL, replace those lines with a macro that displays an error message and then terminates the program.

The issue with the current code is that it masks errors by treating SsINTERNAL as a flavor of success. The current code "swallows" the SsINTERNAL and hides it from the user. The user is thus unaware that an abnormal condition occurred.

@zack-vii
Copy link
Contributor

SsInternal is used to flag a situation that requires internal handling. i think it is used in remoteacces of treeshr to indicate a remote operation failed because of x is not supported so the code will try method y... something like that.

@zack-vii
Copy link
Contributor

that said i agree the ok, error, true, and false should bbe consolidated tto follow one schema. but earlier attempts failed is it was either unclear if the methods are used by users (so we cannot chamge behaviour) and it was not always clear if 0 means success or false. we ended up replacing 0, 1 and -1 with the names you listed in order to help with that. however, noone dared to switch true/false to success/error or vice versa because we were afrait we would break code. today we have a better separation of public and private headers. so its worth checking if former concerns still hold

@mwinkel-dev
Copy link
Contributor Author

mwinkel-dev commented Apr 16, 2024

Hi @zack-vii -- Thanks for the history.

If you would prefer to discuss this via a Zoom meeting, let me know and I'll arrange to be available at whatever time is convenient for you. (There are times when a complex topic is easier to handle with a discussion than with the written word.)

Reverting to the written word, I will try again to communicate the issue.

In my opinion, there is a difference between what we expect SsINTERNAL to do and what the code actually does.

There are three separate issues here.

  • I concur that there are risks in consolidating the error schemas. And I am hesitant to recommend such consolidation, given the potential impact on the customers, plus the current level of regression testing.
  • Please review all of the links to the lines of code given in the bug report. Which portions of that code do you recommend be changed? In this example, the lowest-level routine can generate SsINTERNAL and none of the routines in the entire call stack check or handle the SsINTERNAL. (More accurately stated, they do check the status but ignore the SsINTERNAL because they view it as "success".) Shouldn't that instead be considered an error?
  • Throughout the MDSplus source code, there are other instances similar to the above example. What is the "best practice" for fixing those code sections?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants