Elastic web crawler known issues
editElastic web crawler known issues
editThe Elastic web crawler has the following known issues:
-
The crawler does not crawl pure JavaScript single-page applications (SPAs).
We recommend looking at dynamic rendering to help your crawler properly index your JavaScript websites. Another option is to serve a static HTML version of your Javascript website, using a solution such as Prerender.
-
The crawler does not support dynamic content.
The crawler does not execute JavaScript, and it only pulls text from HTML elements.
-
URLs being indexed despite having duplicate content and a canonical URL setting.
Canonical URL link tags are embedded within HTML source for pages that duplicate the content of other pages. Refer to Duplicate document handling for details. The crawler identifies duplicate content by hashing the content of default deduplication fields derived from the page. These fields are defined by the configuration setting
connector.crawler.extraction.default_deduplication_fields
.The web crawler checks your index for an existing document with the same content hash. Users have faced issues where they set canonical link tags for a page that does not have identical content, because the hashes are different. However, upon inspection, the content is the same.
Use the following workaround:
You can manage which fields the web crawler uses to create the content hash. If your pages all define canonical URLs, you could safely change your deduplication fields settings to include only the
url
field. Otherwise, you may need more fields to help check for duplicates. By default, the web crawler checksbody_content
,headings
,links
,meta_description
,meta_keywords
, andtitle
fields.