Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop support for optional data values starting with "x_" #318

Open
jayjacobs opened this issue May 25, 2024 · 6 comments
Open

Drop support for optional data values starting with "x_" #318

jayjacobs opened this issue May 25, 2024 · 6 comments

Comments

@jayjacobs
Copy link

Having optional fields in structured data adds complexity for parsing at scale AND it is not being used in any sort of practical way right now.

there are two fields in use that are generated by the upconvert process that are either for legacy reasons or specifically for the CNA, not for the data or any end users.

  • containers.cna.x_ConverterErrors
  • containers.cna.x_legacyV4Record

There is one field added by what I assume are cve/json generators and we should consider adding this as a standard field into the schema:

And then there are three fields that may be isolated to redhat:

  • containers.cna.x_redHatCweChain (2 CVEs)
  • containers.cna.x_redhatCweChain (444 CVEs)
  • containers.cna.affected.x_redhatStatus (2 CVEs)

There are no other uses of any optional "x_" instances.

@mprpic
Copy link
Contributor

mprpic commented May 29, 2024

  • containers.cna.x_redHatCweChain (2 CVEs)

These have been cleaned up.

  • containers.cna.x_redhatCweChain (444 CVEs)

We use this for all CVEs where we have more fine-grained CWE data, so I do find it useful. I would probably also argue there are other cases where some data can be included in x_ attributes first just to identify its viability and then move it to an official field and clean up the previous usages. I thought that was the whole point of x_ attributes.

  • containers.cna.affected.x_redhatStatus (2 CVEs)

These are no longer used because x_ attributes are only allowed at the root level. The two records that still had them were also cleaned up.

@jayjacobs
Copy link
Author

Awesome, thanks for the response @mprpic and appreciate the cleanup.

The request to drop "x_*" fields is not based on historical precedence at all. There may have been decisions made and discussions about the value and purpose of these optional fields. I'm just pointing out in practice there is now one custom field from one CNA (and then the x_generator field), and the cost of writing an exception into parsers may not be worth the global benefit of allowing anyone to stick anything into structured data.

I am working through my experience parsing and using this data. Having optional fields that could be applied inconsistency across records opens up parsers to potential errors and even modes of failures. While I currently have been forced to create an exception in my code for this one source of JSON data, it would be nice (higher quality, cleaner and standardized data) if every JSON parser created did not have to write an exception that requires a regex check against fields for this data.

@mprpic
Copy link
Contributor

mprpic commented May 29, 2024

x_generator could indeed be made into a proper object similar to how the CSAF standard does it: https://docs.oasis-open.org/csaf/csaf/v2.0/os/csaf-v2.0-os.html#321123-document-property---tracking---generator

Our CWE data could also be moved into a specific note of some sort or somewhere else, so I guess I'd be fine with dropping the x_ attributes.

Though, just out of curiosity, how are you parsing this data that there extra fields would cause issues? In my mind, whatever you get from the record that validates against the schema is correct, and you get to choose whichever fields you want to make use of. You can completely ignore any attributes that start with x_ and not handle them at all.

@jayjacobs
Copy link
Author

I am trying to streamline and standardize all of my JSON intake and I'm building out multiple layers/steps for each data ingestion task. Meaning I have a single JSON parsing level that infers the data schema from the JSON data (since many JSON sources do not create schemas and/or maintain them properly). The challenge here is that JSON stores everything as a single stand-alone record, and for data analysis and knowledge extraction I need the data in a columnar format. Which means extracting structured and normalized data and then storing data in parquet files. If one CVE record adds a "x_foobar" field, the parser would add an "x_foobar" field for every CVE record and leave it blank for all but one. Moving to a columnar format is very common for data analysis.

The main challenges I have with the "x_*" records is that any one of the few hundred CNAs can "legally" dump whatever structure they want and I have to deal with it or code up an exception for this one particular source of data (which seems like I'll need to do for other data sources anyway). A great example of this is the decision to include a full copy of the v4 record in the x_legacyV4Record, the first few passes I did on a generic JSON parser took twice as long to run (embarrassingly my first parser took over 2 hours to run, excluding the x_legacyV4Record it was 1h20m on the full collection of CVE data, now it's down to under 10 minutes). That x_legacyV4Record section added several dozen (unnecessary) parquet files and generated several data consistency errors (since many v4 records fails schema validation).

After saying all of that, my local solution isn't really what this issue is about. I'll have to work in exceptions for weird data into my process anyway and this suggested change to the 5.2 schema isn't going to affect my own implementation or help solve a problem for me personally because I'm going to code around it either way.

Allowing optional data (perhaps complex data structures) with federated data collection can present a challenge to anyone trying to automatically parse the data. The fewer "surprises" we can keep out of the data the better. And combined with the fact that nobody is taking any sort of real advantage of these fields means that the CVE program is supporting complexity with little (or no) reward. Hence I suggest we remove the support for something that presents risks to consumers/parsers with very little benefit.

@mprpic
Copy link
Contributor

mprpic commented May 31, 2024

+1, your reasoning makes sense!

Will you be sharing the final parquet files somewhere btw? It would be neat to define some common queries on top of those files that could be run through DuckDB à la https://www.nikolasgoebel.com/2024/05/28/duckdb-doesnt-need-data.html :-)

@jayjacobs
Copy link
Author

My first pass at parsing CVE JSON into parquet files was in R and is rather slow, but those parquet files are in this repo: https://github.com/jayjacobs/visualcve/tree/main/cache/cveparse. The latest pass is in python and a lot more robust (and much faster). That repo is the backend for the github pages: https://jayjacobs.github.io/visualcve/

I had intended to refresh that repo fairly often, but so far that's maybe weekly if I'm lucky. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants