Feature layer query introduces duplicates when querying above 2000 records #1870

HDO-B · 2024-07-17T13:08:23Z

Since the query refactor in 2.3.0 the query method does not correctly request all features when the number of features exceed the maxRecord limit of the feature service.

A query that should result in 18k features results in 20k features with 3~k duplicates. The same query on 2.2.X < has no issues.

The issue was found when querying a feature service with a maxRecord set at 5000, with a query returning more than 15k features.

I found some discrepancies between the old and new code.

Default maxRecord count instead of service property
When the query exceeds the transfer limit, it will instate a resultRecordCount of 2000 and an resultOffset of the same amount.

if "resultRecordCount" not in params:
        # assign initial value after first query
        params["resultRecordCount"] = 2000
if "resultOffset" in params:
    # add the number we found to the offset so we don't have doubles
    params["resultOffset"] = params["resultOffset"] + len(
        result["features"]
    )
else:
    # initial offset after first query (result record count set by user or up above)
    params["resultOffset"] = params["resultRecordCount"]

This may be a default value, but when the default return amount on the Feature Service is higher this will result in a faulty second query with 2000 returned records and a 2000 offset, while already 5000 features had been returned in the first query.

First query may or may not be ordered.
The second problem arises from the fact that the features returned in the first query (at the top of the query method) are not ordered. However the following query results using the resultRecordCount and resultOffset are ordered. Which means that these results may or may not contain features that have already been returned in the very first query. Before the refactor this wasn't an issue because the code checked if pagination was needed before performing the first query.

def _query(layer, url, params, raw=False):
    """returns results of query"""
    result = {}
    try:
        # Layer query call
        result = layer._con.post(url, params, token=layer._token)  # This one is not ordered?

        # Figure out what to return
        if "error" in result:
            raise ValueError(result)
        elif "returnCountOnly" in params and _is_true(params["returnCountOnly"]):
            # returns an int
            return result["count"]
        elif "returnIdsOnly" in params and _is_true(params["returnIdsOnly"]):
            # returns a dict with keys: 'objectIdFieldName' and 'objectIds'
            return result
        elif "returnExtentOnly" in params and _is_true(params["returnExtentOnly"]):
            # returns extent dictionary with key: 'extent'
            return result
        elif _is_true(raw):
            return result
        elif "resultRecordCount" in params and params["resultRecordCount"] == len(
            result["features"]
        ):
            return arcgis_features.FeatureSet.from_dict(result)
        else:
            # we have features to return
            features = result["features"]

        # If none of the ifs above worked then keep going to find more features
        # Make sure we have all features
        if "exceededTransferLimit" in result:
            while (
                "exceededTransferLimit" in result
                and result["exceededTransferLimit"] == True
            ):
                if "resultRecordCount" not in params:
                    # assign initial value after first query
                    params["resultRecordCount"] = 2000
                if "resultOffset" in params:
                    # add the number we found to the offset so we don't have doubles
                    params["resultOffset"] = params["resultOffset"] + len(
                        result["features"]
                    )
                else:
                    # initial offset after first query (result record count set by user or up above)
                    params["resultOffset"] = params["resultRecordCount"]

                result = layer._con.post(path=url, postdata=params, token=layer._token)  # These queries are ordered?
                # add new features to the list
                features = features + result["features"]
        # assign complete list
        result["features"] = features

I use a workaround for these issues by:

forcing an ordering on all query so the first query will also have forced ordering. Changing the code to check if pagination is needed before performing the feature queries would also fix this (like before 2.3.0)
order_by_fields="OBJECTID ASC"
To make queries with the correct number of features, this part in the query method params["resultRecordCount"] = 2000 is replaced by params["resultRecordCount"] = len(features) where the length of the returned features from the first query is set as the maxRecord amount that the first query has reached. This might as well be a value read from the service properties like before.

The text was updated successfully, but these errors were encountered:

nanaeaubry · 2024-07-31T07:31:04Z

Thanks for reporting we will put in a fix!

nanaeaubry · 2024-07-31T08:05:53Z

@HDO-B Can you provide a sample feature layer where this is occurring? We have tested with several feature layers of size greater that 2000 and cannot reproduce

Even using a Feature layer with over 100000 features:

HDO-B · 2024-09-02T15:50:02Z

Sure!

I have a sample feature layer that has 30k random points with a maxRecord set to 5000 features.
https://services.arcgis.com/WYfEIrzV4ySDv6rV/arcgis/rest/services/testdata/FeatureServer/0

When I query this layer using "1=1" it returns 33000 features

The cause of this is the maxRecord is set to 5000 features while the code expects a default of 2k. When the first query notices that the result is exceeding the total amount of features the code will put a limit of 2000 features of the next request in the params. The offset is also set to the same amount.

Since the featureservice is returning 5000 on the first query, 3000 features are queried twice in the following queries using a limit of 2k and an offset of 2k.

HDO-B added the bug label Jul 17, 2024

nanaeaubry self-assigned this Jul 31, 2024

nanaeaubry added the cannot reproduce cannot reproduce the error/bug/issue label Jul 31, 2024

HDO-B mentioned this issue Sep 12, 2024

query ignores result_record_count (and skips some items) if larger than max return count #2027

Open

nanaeaubry removed the cannot reproduce cannot reproduce the error/bug/issue label Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature layer query introduces duplicates when querying above 2000 records #1870

Feature layer query introduces duplicates when querying above 2000 records #1870

HDO-B commented Jul 17, 2024

nanaeaubry commented Jul 31, 2024

nanaeaubry commented Jul 31, 2024 •

edited

Loading

HDO-B commented Sep 2, 2024

Feature layer query introduces duplicates when querying above 2000 records #1870

Feature layer query introduces duplicates when querying above 2000 records #1870

Comments

HDO-B commented Jul 17, 2024

nanaeaubry commented Jul 31, 2024

nanaeaubry commented Jul 31, 2024 • edited Loading

HDO-B commented Sep 2, 2024

nanaeaubry commented Jul 31, 2024 •

edited

Loading