Web crawler (beta) FAQ

edit

Web crawler (beta) FAQ

edit

The Elastic Enterprise Search web crawler is a beta feature. Beta features are subject to change and are not covered by the support SLA of general release (GA) features. Elastic plans to promote this feature to GA in a future release.

View frequently asked questions about the Enterprise Search web crawler:

See Web crawler (beta) reference for detailed technical information about the web crawler.

We also welcome your feedback.

What functionality is supported?

edit
  • Crawling publicly-accessible HTTP/HTTPS websites
  • Support for crawling multiple domains per-Engine
  • Robots meta tag support
  • Robots "nofollow" support

    Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes

  • Basic content extraction

    The web crawler will extract content for a predefined, unconfigurable set of fields from each page it visits.

  • "Entry points"

    Entry points allow customers to specify where the web crawler begins crawling each domain.

  • "Crawl rules"

    Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.

  • Logging of each crawl

    Logs are representative of an entire crawl, which encompasses all domains in an engine.

  • User interfaces for managing domains, entry points, and crawl rules

What functionality is not supported?

edit
  • Automatic or scheduled crawling

    Start crawls manually from the UI or use the crawler API to schedule a crawl on-demand.

  • Single-page app (SPA) support

    The crawler cannot currently crawl pages that are pure JavaScript single-page apps.

  • Configurable content extraction

    Content extraction is currently limited to an unconfigurable, predefined set of fields.

  • Crawling private websites or websites behind authentication
  • Sitemap support

    The web crawler currently has no knowledge of sitemaps and cannot utilize them to identify pages to visit.

  • robots.txt support

    The web crawler does not currently adhere to robots.txt rules. The crawler only honors robots meta tags set to nofollow and links with rel="nofollow" attributes.

  • Crawl persistence

    If a crawl is unexpectedly stopped before it finishes, it will not be able to restart where it left off. You can restart a crawl again from the beginning. The crawler will not duplicate documents that it already indexed.

  • Extracting content from files

    Currently, the web crawler will only extract content from HTML content.