Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature layer query introduces duplicates when querying above 2000 records #1870

Open
HDO-B opened this issue Jul 17, 2024 · 3 comments
Open
Assignees
Labels

Comments

@HDO-B
Copy link

HDO-B commented Jul 17, 2024

Since the query refactor in 2.3.0 the query method does not correctly request all features when the number of features exceed the maxRecord limit of the feature service.

A query that should result in 18k features results in 20k features with 3~k duplicates. The same query on 2.2.X < has no issues.

The issue was found when querying a feature service with a maxRecord set at 5000, with a query returning more than 15k features.

I found some discrepancies between the old and new code.

  1. Default maxRecord count instead of service property
    When the query exceeds the transfer limit, it will instate a resultRecordCount of 2000 and an resultOffset of the same amount.
if "resultRecordCount" not in params:
        # assign initial value after first query
        params["resultRecordCount"] = 2000
if "resultOffset" in params:
    # add the number we found to the offset so we don't have doubles
    params["resultOffset"] = params["resultOffset"] + len(
        result["features"]
    )
else:
    # initial offset after first query (result record count set by user or up above)
    params["resultOffset"] = params["resultRecordCount"]

This may be a default value, but when the default return amount on the Feature Service is higher this will result in a faulty second query with 2000 returned records and a 2000 offset, while already 5000 features had been returned in the first query.

  1. First query may or may not be ordered.
    The second problem arises from the fact that the features returned in the first query (at the top of the query method) are not ordered. However the following query results using the resultRecordCount and resultOffset are ordered. Which means that these results may or may not contain features that have already been returned in the very first query. Before the refactor this wasn't an issue because the code checked if pagination was needed before performing the first query.
def _query(layer, url, params, raw=False):
    """returns results of query"""
    result = {}
    try:
        # Layer query call
        result = layer._con.post(url, params, token=layer._token)  # This one is not ordered?

        # Figure out what to return
        if "error" in result:
            raise ValueError(result)
        elif "returnCountOnly" in params and _is_true(params["returnCountOnly"]):
            # returns an int
            return result["count"]
        elif "returnIdsOnly" in params and _is_true(params["returnIdsOnly"]):
            # returns a dict with keys: 'objectIdFieldName' and 'objectIds'
            return result
        elif "returnExtentOnly" in params and _is_true(params["returnExtentOnly"]):
            # returns extent dictionary with key: 'extent'
            return result
        elif _is_true(raw):
            return result
        elif "resultRecordCount" in params and params["resultRecordCount"] == len(
            result["features"]
        ):
            return arcgis_features.FeatureSet.from_dict(result)
        else:
            # we have features to return
            features = result["features"]

        # If none of the ifs above worked then keep going to find more features
        # Make sure we have all features
        if "exceededTransferLimit" in result:
            while (
                "exceededTransferLimit" in result
                and result["exceededTransferLimit"] == True
            ):
                if "resultRecordCount" not in params:
                    # assign initial value after first query
                    params["resultRecordCount"] = 2000
                if "resultOffset" in params:
                    # add the number we found to the offset so we don't have doubles
                    params["resultOffset"] = params["resultOffset"] + len(
                        result["features"]
                    )
                else:
                    # initial offset after first query (result record count set by user or up above)
                    params["resultOffset"] = params["resultRecordCount"]

                result = layer._con.post(path=url, postdata=params, token=layer._token)  # These queries are ordered?
                # add new features to the list
                features = features + result["features"]
        # assign complete list
        result["features"] = features

I use a workaround for these issues by:

  • forcing an ordering on all query so the first query will also have forced ordering. Changing the code to check if pagination is needed before performing the feature queries would also fix this (like before 2.3.0)
    order_by_fields="OBJECTID ASC"

  • To make queries with the correct number of features, this part in the query method params["resultRecordCount"] = 2000 is replaced by params["resultRecordCount"] = len(features) where the length of the returned features from the first query is set as the maxRecord amount that the first query has reached. This might as well be a value read from the service properties like before.

@HDO-B HDO-B added the bug label Jul 17, 2024
@nanaeaubry nanaeaubry self-assigned this Jul 31, 2024
@nanaeaubry
Copy link
Contributor

Thanks for reporting we will put in a fix!

@nanaeaubry
Copy link
Contributor

nanaeaubry commented Jul 31, 2024

@HDO-B Can you provide a sample feature layer where this is occurring? We have tested with several feature layers of size greater that 2000 and cannot reproduce

Even using a Feature layer with over 100000 features:
image

image

@nanaeaubry nanaeaubry added the cannot reproduce cannot reproduce the error/bug/issue label Jul 31, 2024
@HDO-B
Copy link
Author

HDO-B commented Sep 2, 2024

Sure!

I have a sample feature layer that has 30k random points with a maxRecord set to 5000 features.
https://services.arcgis.com/WYfEIrzV4ySDv6rV/arcgis/rest/services/testdata/FeatureServer/0

When I query this layer using "1=1" it returns 33000 features
image

The cause of this is the maxRecord is set to 5000 features while the code expects a default of 2k. When the first query notices that the result is exceeding the total amount of features the code will put a limit of 2000 features of the next request in the params. The offset is also set to the same amount.

Since the featureservice is returning 5000 on the first query, 3000 features are queried twice in the following queries using a limit of 2k and an offset of 2k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants