Crawl web content

edit

Crawl web content

edit

The Elastic Enterprise Search web crawler is a beta feature. Beta features are subject to change and are not covered by the support SLA of general release (GA) features. Elastic plans to promote this feature to GA in a future release.

Complete the following steps to crawl your web content using the Enterprise Search web crawler.

1. Identify your web content and create engines:

2. For each engine, complete the first crawl cycle:

3. Re-crawl your web content and optionally schedule crawls:

Identify web content

edit

Before crawling your web content, you must inventory your domains and decide which you’d like to crawl and where you’d like to store the crawled documents. Consider an organization managing the following web content:

Content type URL

Website

https://example.com

Blog

https://example.com/blog

Ecommerce application

https://shop.example.com

Ecommerce administrative dashboard

https://shop.example.com/admin

This organization may decide to index their website and blog using the web crawler, while using the Documents API to index their ecommerce data.

Complete this exercise with your own content to determine which domains you’d like to crawl. If you haven’t already, read the Web crawler (beta) FAQ to evaluate the crawler’s capabilities and limitations.

After choosing domains to crawl, decide where you will store the resulting search documents. Consider another organization with the following web content.

Content type URL

Website

https://example.com

Blog

https://blog.example.com

Although their website and blog are separate domains, they may choose to index them into a single engine.

Again, complete this exercise with your own content. Choose one engine per search experience. Each engine has its own crawl configuration, and is limited to a single active crawl.

Create engine

edit

After reviewing Identify web content, create one or more new engines for your content. See Create an engine for an explanation of the process.

The following sections of this document describe a crawl cycle composed of the following steps: manage, monitor, troubleshoot. Repeat this cycle for each new engine.

Manage crawl

edit

A crawl is the process by which the web crawler discovers, extracts, and indexes web content into an engine. See Crawl in the web crawler reference for a detailed explanation of a crawl.

Primarily, you manage each crawl in the App Search dashboard. There, you manage domains, entry points, and crawl rules; and start and cancel the active crawl. However, you can also manage a crawl using files, such as robots.txt files and sitemaps. And you can embed crawler instructions within your content, such as canonical URL link tags, robots meta tags, and nofollow links. You can also start and cancel a crawl using the App Search API.

The following sections cover these topics.

Manage domains

edit

A domain is a website or property you’d like to crawl. You must associate one or more domains to a crawl. See Domain in the web crawler reference for a detailed explanation of a domain.

Manage the domains for a crawl through the web crawler dashboard. From the engine menu, choose Web Crawler.

web crawler navigation

Add your first domain on the getting started screen.

add first domain

From there, you can view, add, manage, and delete domains using the web crawler dashboard.

web crawler dashboard domains

Manage entry points

edit

Each domain must have one or more entry points. These are paths from which the crawler will start each crawl. See Entry point in the web crawler reference for a detailed explanation of an entry point.

Manage the entry points for a domain through the domain dashboard. From the engine menu, choose Web Crawler. Choose Manage next to the domain you’d like to manage. Then locate the Entry Points section of the dashboard.

domains dashboard entry points

From here, you can view, add, edit, and delete entry points.

The dashboard adds a default entry point of / to each domain. You can delete this entry point, but each domain must have at least one entry point.

Manage crawl rules

edit

Each domain must also have one or more crawl rules. These rules instruct the crawler which pages to crawl within the domain. See Crawl rule in the web crawler reference for a detailed explanation of a crawl rule.

Manage the crawl rules for a domain through the domain dashboard. From the engine menu, choose Web Crawler. Choose Manage next to the domain you’d like to manage. Then locate the Crawl Rules section of the dashboard.

domains dashboard crawl rules

From here, you can view, add, edit, delete, and re-order crawl rules.

The dashboard adds a default crawl rule to allow all paths. You cannot delete this crawl rule, but you can insert more restrictive rules in front of this rule. See Crawl rule for explanations of crawl rule logic and the effects of crawl rule order.

Manage robots.txt files

edit

Each domain may have a robots.txt file. This is a plain text file that provides instructions to web crawlers. The instructions within the file, also called directives, communicate which paths within that domain are disallowed (and allowed) for crawling. See Robots.txt file in the web crawler reference for a detailed explanation of a robots.txt file.

You can also use a robots.txt file to specify sitemaps for a domain. See Manage sitemaps.

Most web crawlers automatically fetch and parse the robots.txt file for each domain they crawl. If you already publish a robots.txt file for other web crawlers, be aware the Enterprise Search web crawler will fetch this file and honor the directives within it. You may want to add, remove, or update the robots.txt file for each of your domains.

Example: add a robots.txt file to a domain

To add a robots.txt file to the domain https://shop.example.com:

  1. Determine which paths within the domain you’d like to exclude.
  2. Create a robots.txt file with the appropriate directives from the Robots exclusion standard. For instance:

    User-agent: *
    Disallow: /cart
    Disallow: /login
    Disallow: /account
  3. Publish the file, with filename robots.txt, at the root of the domain: https://shop.example.com/robots.txt.

The next time the web crawler visits the domain, it will fetch and parse the robots.txt file.

Manage sitemaps

edit

Each domain may have one or more sitemaps. These are XML files that inform web crawlers about pages within that domain. XML elements within these files identify specific URLs that are available for crawling. See Sitemap in the web crawler reference for a detailed explanation of a sitemap.

If you already publish sitemaps for other web crawlers, the Enterprise Search web crawler can use the same sitemaps. However, for the Enterprise Search web crawler to discover your sitemaps, you must specify them within robots.txt files.

You may want to add, remove, or update sitemaps for each of your domains.

Example: add a sitemap to a domain

To add a sitemap to the domain https://shop.example.com:

  1. Determine which pages within the domain you’d like to include. Ensure these paths are allowed by the domain’s crawl rules and the directives within the domain’s robots.txt file.
  2. Create a sitemap file with the appropriate elements from the sitemap standard. For instance:

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <url>
        <loc>https://shop.example.com/products/1/</loc>
      </url>
      <url>
        <loc>https://shop.example.com/products/2/</loc>
      </url>
      <url>
        <loc>https://shop.example.com/products/3/</loc>
      </url>
    </urlset>
  3. Publish the file on your site, for example, at the root of the domain: https://shop.example.com/sitemap.xml.
  4. Create or modify the robots.txt file for the domain, located at https://shop.example.com/robots.txt. Anywhere within the file, add a Sitemap directive that provides the location of the sitemap. For instance:

    Sitemap: https://shop.example.com/sitemap.xml
  5. Publish the new or updated robots.txt file.

The next time the web crawler visits the domain, it will fetch and parse the robots.txt file and the sitemap.

Embed web crawler instructions within content

edit

You can also embed instructions for the web crawler within your HTML content. These instructions are specific HTML tags, attributes, and values that affect the web crawler’s behavior.

The Enterprise Search web crawler recognizes the following embedded instructions, each of which is described further in the web crawler reference:

Start crawl

edit

Start a crawl from the web crawler or domain dashboard, or using the App Search API.

To use a dashboard, navigate to Web Crawler, then optionally choose a domain to manage. Choose the Start a Crawl button at the top of the dashboard.

start crawl button default

Each engine may have only one active crawl. The start button changes state to reflect a crawl is in progress.

start crawl button crawling

To start a crawl programmatically, refer to the following API reference:

Create a new crawl request

Cancel crawl

edit

Cancel an active crawl from the web crawler or domain dashboard, or using the App Search API.

To use a dashboard, navigate to Web Crawler, then optionally choose a domain to manage. Expand the Crawling…​ button at the top of the dashboard. Choose Cancel Crawl.

cancel crawl button

To cancel a crawl programmatically, refer to the following API reference:

Cancel an active crawl

Monitor crawl

edit

You can monitor a crawl while it is running or audit the crawl after it has completed.

Monitoring includes viewing the crawl status, crawl request ID, web crawler event logs (optionally filtered by the crawl ID and a specific URL), web crawler system logs, and documents indexed by the crawl.

The following sections cover these topics.

View crawl status

edit

Each crawl has a status, which quickly communicates its state. See Crawl status in the web crawler reference for a description of each crawl status.

View the status of a crawl within the web crawler dashboard or using the App Search API.

To use the dashboard, navigate to Web Crawler and locate the Recent crawl requests section.

domains dashboard recent crawl requests

Refer to the Status column for the status of each recent crawl.

To get a crawl status programmatically, refer to the following API references:

View crawl request ID

edit

Each crawl has an associated crawl request, which is identified by a unique ID in the following format: 60106315beae67d49a8e787d. Use a crawl request ID to filter the web crawler events logs to a specific crawl.

View the request ID of a crawl within the web crawler dashboard or using the App Search API.

To use the dashboard, navigate to Web Crawler and locate the Recent crawl requests section.

domains dashboard recent crawl requests

Refer to the Request ID column for the request ID of each recent crawl.

To get a crawl request ID programmatically, refer to the following API references:

View web crawler events logs

edit

The Enterprise Search web crawler records detailed structured events logs for each crawl. The crawler indexes these logs into Elasticsearch, and you can view the logs using Kibana.

See View web crawler events logs for a step by step process to view the web crawler events logs in Kibana.

For a complete reference of all events, see Web crawler (beta) events logs reference.

View web crawler events by crawl ID and URL

edit

To monitor a specific crawl or a specific domain, you must filter the web crawler events logs within Kibana.

To view the events for a specific crawl, first get the crawl’s request ID. Then filter within Kibana on the crawler.crawl.id field.

You can filter further to narrow your results to a specific URL. Use the following fields:

  • The full URL: url.full
  • Required components of the URL: url.scheme, url.domain, url.port, url.path
  • Optional components of the URL: url.query, url.fragment, url.username, url.password

View web crawler system logs

edit

If you are managing your own Enterprise Search deployment, you can also view the web crawler system logs.

View these logs on disk in the crawler.log file.

The events in these logs are less verbose then the web crawler events logs, but they can help solve web crawler issues. Each event has a crawl request ID, which allows you to analyze the logs for a specific crawl.

View indexed documents

edit

The web crawler extracts the content from each web page, transforming it into a search document. It indexes these documents within the engine associated with the crawl. See Content extraction and indexing for more details on this process, and see Web crawler schema for more details on the structure of each search document.

View the indexed documents using the Documents or Query Tester views within the App Search dashboard, or use the search API.

To find a specific document, wrap the document’s URL in quotes, and use that as your search query. For example: "https://example.com/some/page.html". If the document is present in the engine, it should be a top result (or only result).

To access the documents dashboard, choose Documents from the engine menu.

documents dashboard

To access the query tester, choose Query Tester from the engine menu.

query tester dashboard

To use the search API, refer to the following API reference:

Search API

Troubleshoot crawl

edit

A crawl may not behave as expected or discover and index the documents you expected. The web crawler faces many challenges while it crawls, including:

  • Network issues: lost packets, timeouts, DNS issues
  • Resource contention: memory usage, CPU cycles
  • Parsing problems: broken HTML
  • HTTP protocol issues: broken HTTP servers, incorrect HTTP status codes

For a detailed look at crawl issues, see Sprinting to a crawl: Building an effective web crawler.

However, these issues generally fall into three categories: crawl stability, content discovery, and content extraction and indexing. Use the following sections to guide your troubleshooting:

See Troubleshoot crawl stability if:

  • You’re not sure where to start (resolve stability issues first)
  • No documents in the engine
  • Many documents missing or outdated
  • Crawl fails
  • Crawl runs for the maximum duration (defaults to 24 hours)

See Troubleshoot content discovery if:

  • Specific documents missing or outdated

See Troubleshoot content extraction and indexing if:

  • Specific documents missing or outdated
  • Incorrect content within documents
  • Content missing from documents

Troubleshoot crawl stability

edit

Crawl stability issues prevent the crawler from discovering, extracting, and indexing your content. It is therefore critical you address these issues first.

Use the following techniques to troubleshoot crawl stability issues.

Analyze web crawler events logs for the most recent crawl:

First:

  1. Find the crawl request ID for the most recent crawl.
  2. Filter the web crawler events logs by that ID.

Then:

  • Order the events by timestamp, oldest first.
  • Locate the crawl-end event and preceding events. These events communicate what happened before the crawl failed.

Analyze web crawler system logs:

These logs may contain additional information about your crawl.

See View web crawler system logs.

Modify the web crawler configuration:

As a last resort, operators can modify the web crawler configuration, including resource limits.

See Web crawler configuration settings in the web crawler reference.

Troubleshoot content discovery

edit

After your crawls are stable, you may find the crawler is not discovering your content as expected. It’s helpful to understand how the web crawler discovers content. See Content discovery in the web crawler reference.

Use the following techniques to troubleshoot content discovery issues.

Confirm the most recent crawl completed successfully:

View the status of the most recent crawl to confirm it completed successfully. See View crawl status.

If the crawl failed, look for signs of crawl stability issues. See Troubleshoot crawl stability.

View indexed documents to confirm missing pages:

Identify which pages are missing from your engine, or focus on specific pages. See View indexed documents for instructions to view all documents and specific documents.

Analyze web crawler events logs for the most recent crawl:

First:

  1. Find the crawl request ID for the most recent crawl.
  2. Filter the web crawler events logsby that ID.
  3. Find the URL of a specific document missing from the engine.
  4. Filter the web crawler events logs by that URL.

Then:

  • Locate url-discover events to confirm the crawler has seen links to your page. The outcome and message fields may explain why the web crawler did not crawl the page.
  • If url-discover events indicate discovery was successful, locate url-fetch events to analyze the fetching phase of the crawl.

Analyze web crawler system logs:

These may contain additional information about specific pages.

See View web crawler system logs.

Address specific content discovery problems:

Problem Description Solution

External domain

The web crawler does not follow links that go outside the domains configured for each crawl.

Manage domains for your crawl to add any missing domains.

Disallowed path

The web crawler does not follow links whose paths are disallowed by a domain’s crawl rules or robots.txt directives.

Manage crawl rules and robots.txt files for each domain to ensure paths are allowed.

No incoming links

The web crawler cannot find pages that have no incoming links, unless you provide the path as an entry point. See Content discovery for an explanation of how the web crawler discovers content.

Add links to the content from other content that the web crawler has already discovered, or explicitly add the URL as an entry point or within a sitemap.

Nofollow links

The web crawler does not follow nofollow links.

Remove the nofollow link to allow content discovery.

nofollow robots meta tag

If a page contains a nofollow robots meta tag, the web crawler will not follows links from that page.

Remove the meta tag from your page.

Page too large

The web crawler does not parse HTTP responses larger than crawler.http.response_size.limit.

Reduce the size of your page. Or, increase the limit for your deployment. Increasing the limit may increase crawl durations and resource consumption, and could reduce crawl stability.

Too many redirects

The web crawler does not follow redirect chains longer than crawler.http.redirects.limit.

Reduce the number of redirects for the page. Or, increase the limit for your deployment. Increasing the limit may increase crawl durations and resource consumption, and could reduce crawl stability.

Network latency

The web crawler fails requests that exceed the following network timeouts: crawler.http.connection_timeout, crawler.http.read_timeout, crawler.http.request_timeout.

Reduce network latency. Or, increase these timeouts for your deployment. Increasing the timeouts may increase crawl durations and resource consumption, and could reduce crawl stability.

HTTP errors

The web crawler cannot discover and index content if it cannot fetch HTML pages from a domain. The web crawler will not index pages that respond with a 4xx or 5xx response code.

Fix HTTP server errors. Ensure correct HTTP response codes.

HTML errors

The web crawler cannot parse extremely broken HTML pages. In that case, the web crawler cannot index the page, and cannot discover links coming from that page.

Use the W3C markup validation service to identify and resolve HTML errors in your content.

Security

The web crawler cannot access content requiring authentication or authorization.

Remove the security to allow access to the web crawler.

Non-HTML content

The web crawler does not extract and index non-HTML content (e.g. JavaScript, PDF).

Publish your content in HTML format.

Non-HTTP protocol

The web crawler recognizes only the HTTP and HTTPS protocols.

Publish your content at URLs using HTTP or HTTPS protocols.

Invalid SSL certificate

The web crawler will not crawl HTTPS pages with invalid certificates.

Replace invalid certificates with valid certificates.

Troubleshoot content extraction and indexing

edit

The web crawler may be discovering your content but not extracting and indexing it as expected. It’s helpful to understand how the web crawler extracts and indexes content. See Content extraction and indexing in the web crawler reference.

Use the following techniques to troubleshoot content discovery issues.

Confirm the most recent crawl completed successfully:

View the status of the most recent crawl to confirm it completed successfully. See View crawl status.

If the crawl failed, look for signs of crawl stability issues. See Troubleshoot crawl stability.

View indexed documents to confirm missing pages:

Identify which pages are missing from your engine, or focus on specific pages. See View indexed documents for instructions to view all documents and specific documents.

If documents are missing from the engine, look for signs of content discovery issues. See Troubleshoot content discovery.

Analyze web crawler events logs for the most recent crawl:

First:

  1. Find the crawl request ID for the most recent crawl.
  2. Filter the web crawler events logsby that ID.
  3. Find the URL of a specific document missing from the engine.
  4. Filter the web crawler events logs by that URL.

Then:

  • Locate url-extracted events to confirm the crawler was able to extract content from your page. The outcome and message fields may explain why the web crawler could not extract and index the content.
  • If url-extracted events indicate extraction was successful, locate url-output events to confirm the web crawler attempted ingestion of the page’s content.

Analyze web crawler system logs:

These may contain additional information about specific pages.

See View web crawler system logs.

Address specific content extraction and indexing problems:

Problem Description Solution

Duplicate content

If your website contains pages with duplicate content, those pages are stored as a single document within your engine. The document’s additional_urls field indicates the URLs that contain the same content.

Use a canonical URL link tag within any document containing duplicate content.

Non-HTML content

The web crawler does not extract and index non-HTML content (e.g. JavaScript, PDF).

Publish your content in HTML format.

noindex robots meta tag

The web crawler will not index pages that include a noindex robots meta tag.

Remove the meta tag from your page.

Page too large

The web crawler does not parse HTTP responses larger than crawler.http.response_size.limit.

Reduce the size of your page. Or, increase the limit for your deployment. Increasing the limit may increase crawl durations and resource consumption, and could reduce crawl stability.

Truncated fields

The web crawler truncates some fields before indexing the document, according to the following limits: crawler.extraction.body_size.limit crawler.extraction.description_size.limit, crawler.extraction.headings_count.limit, crawler.extraction.indexed_links_count.limit, crawler.extraction.keywords_size.limit, crawler.extraction.title_size.limit.

Reduce the length of these fields within your content. Or, increase these limits for your deployment. Increasing the limits may increase crawl durations and resource consumption, and could reduce crawl stability.

Broken HTML

The web crawler cannot parse extremely broken HTML pages.

Use the W3C markup validation service to identify and resolve HTML errors in your content.

Provide feedback

edit

After troubleshooting your crawl, we’d love to know what worked, what didn’t, and what we can improve.

Please send us your feedback.

Re-crawl web content

edit

For each engine, repeat the manage-monitor-troubleshoot cycle until the web crawler is discovering and indexing your documents as expected.

From there, you move into the next cycle: update your web content, re-crawl your web content, (repeat).

At this point, you may want to move beyond manual crawls and schedule crawls instead.

Schedule crawls

edit

You may want to trigger crawls programatically. Use this technique to crawl in response to an event, such as pushing updated web content. Or, crawl according to a schedule.

To trigger crawls programatically, refer to the following API reference:

Create a new crawl request

Schedule your API calls using a job scheduler, like cron. Or write your own application code to manage crawls.