Elastic web crawler known issues

edit

Elastic web crawler known issues

edit

The Elastic web crawler has the following known issues:

  • The crawler does not crawl pure JavaScript single-page applications (SPAs).

    We recommend looking at dynamic rendering to help your crawler properly index your JavaScript websites. Another option is to serve a static HTML version of your Javascript website, using a solution such as Prerender.

  • The crawler does not support dynamic content.

    The crawler does not execute JavaScript, and it only pulls text from HTML elements.

  • The crawler does not support form-based authentication.

    The crawler currently only supports basic authentication and authentication header (e.g. bearer tokens) authentication methods.

  • URLs being indexed despite having duplicate content and a canonical URL setting.

    Canonical URL link tags are embedded within HTML source for pages that duplicate the content of other pages. Refer to Duplicate document handling for details. The crawler identifies duplicate content by hashing the content of default deduplication fields derived from the page. These fields are defined by the configuration setting connector.crawler.extraction.default_deduplication_fields.

    The web crawler checks your index for an existing document with the same content hash. Users have faced issues where they set canonical link tags for a page that does not have identical content, because the hashes are different. However, upon inspection, the content is the same.

    Use the following workaround:

    You can manage which fields the web crawler uses to create the content hash. If your pages all define canonical URLs, you could safely change your deduplication fields settings to include only the url field. Otherwise, you may need more fields to help check for duplicates. By default, the web crawler checks body_content, headings, links, meta_description, meta_keywords, and title fields.

  • Custom scheduling might break when upgrading from version 8.6 or earlier.

    If you encounter the error 'custom_schedule_triggered': undefined method 'each' for nil:NilClass (NoMethodError), it means the custom scheduling feature migration failed. You can use the following manual workaround:

    POST /.elastic-connectors/_update/<connector-id>
    {
      "doc": {
        "custom_scheduling": {}
      }
    }

    This error can appear on Connectors or Crawlers that aren’t the cause of the issue. If the error continues, try running the above command for every document in the .elastic-connectors index.

  • The web crawler ignores uppercase noindex tags.

    Make sure these tags are lowercase.

  • Updates to the default connector.crawler.http.user_agent are not applied.

    A workaround is to remove the connector prefix and update the crawler.http_agent setting in your Enterprise Search configuration file.

  • The web crawler uses a non-deterministic method to determine thread pool size, which can lead to unexpected behavior.

    This can be worked around by overriding the crawler.workers.pool_size.limit value in the elasticsearch.yml file.

  • Entry points should not have leading spaces.

    Whitespace is not stripped from entry points, so leading spaces will be included in the URL, leading to errors.

  • Updates to the default connector.crawler.http.user_agent are not applied.

    A workaround is to remove the connector prefix and update the crawler.http_agent setting in your Enterprise Search configuration file.

  • Crawler does not support UTF16-LE encoding.

    The crawler does not support UTF16 little endian encoding. A workaround is to encode your files in a supported format such as UTF-8.