Web crawler (beta) events logs reference
editWeb crawler (beta) events logs reference
editThe Enterprise Search web crawler logs many events while discovering, extracting, and indexing web content.
Enterprise Search records these events using Elastic Common Schema (ECS), including a custom field set called crawler.*
for crawler-specific data (like crawl_id
).
To view these events, see View web crawler events logs.
This document provides a reference to these events and their fields.
First, the reference describes the fields common to all web crawler events, including:
Then, the remainder of the document describes different types of web crawler events:
Fields common to all web crawler events
editAll web crawler events include the following common fields.
Crawler-specific fields
edit-
crawler.crawl.id
- A unique ID of a specific crawl.
Base fields
edit-
@timestamp
- A UTC timestamp of the event.
-
event.id
- A unique identifier of the event.
-
event.action
- The type of event. See the sections that follow.
-
message
- A textual description of the event (useful for displaying in a UI for human consumption).
Service fields
edit-
service.ephemeral_id
- A unique identifier of the crawler process generating the ID (changes every time a process is restarted).
-
service.type
-
All events will have this set to
crawler
. -
service.version
- Current version of the Enterprise Search product.
Process fields
edit-
process.pid
- The PID of the crawler instance.
-
process.thread.id
- The id of the thread logging the event.
Host fields
edit-
host.name
- The host name where the crawler instance is deployed.
Crawl lifecycle events
editEach crawl lifecycle event records important checkpoints within the lifecycle of a specific crawl, for example: start, seed, end.
Most of the event information is captured in the message
field, along with the other common fields described above.
The fields below provide additional details.
Each crawl lifecycle event has one of the following values for event.action
:
-
crawl-start
- Emitted when a crawl is started. Includes crawl configuration.
-
crawl-seed
- Emitted every time a crawl is seeded with a set of URLs from the outside. Includes the list of URLs submitted to the crawler.
-
crawl-end
- Emitted when a crawl is ended for any reason (finished, canceled, etc).
-
crawl-status
- Periodic events with a snapshot of crawler status metrics used for monitoring an active crawl over time.
Crawl start events
edit-
event.kind
-
Set to
event
. -
event.type
-
Set to
start
. -
event.action
-
Set to
crawl-start
. -
crawler.crawl.config
- A serialized version of the crawl config.
Crawl seed events
edit-
event.kind
-
Set to
event
. -
event.type
-
Set to
change
. -
event.action
-
Set to
crawl-seed
. -
crawler.crawl.seed_urls
- A list of URLs used to seed a crawl.
-
crawler.url.type
-
A type of the URLs being added:
-
content
for generic content URLs. -
sitemap
for sitemap and sitemap-index URLs. -
feed
for RSS/ATOM feeds.
-
Crawl end events
edit-
event.kind
-
Set to
event
. -
event.type
-
Set to
end
. -
event.action
-
Set to
crawl-end
. -
event.outcome
-
Set to
success
orfailure
depending on how a crawl ended (canceled crawls will be considered failed, etc).
Crawl status events
edit-
event.kind
-
Set to
metric
. -
event.type
-
Set to
info
. -
event.action
-
Set to
crawl-status
. -
crawler.status.*
- A set of metrics describing the global state of a crawl and crawl-specific stats that may be useful to understand the state of a crawl over time.
URL lifecycle events
editEach URL lifecycle event is scoped to a particular URL within a specific crawl. Each event describes what happened to the URL during the crawl, for example: how and when did the crawler discover it?, why did the crawler skip it? These events have enough details to allow a human operator to understand exactly how the system discovered a specific URL, what decisions have been made about it, and what was the result of processing the URL.
Each URL lifecycle event has one of the following values for event.action
:
-
url-seed
- URL submitted to the crawl backlog for processing (from a seed list, from within the crawl, via an API, etc).
-
url-fetch
- URL fetch attempt including timing information, server response headers, HTTP code, etc.
-
url-discover
- URL discovery events. Each time the crawler discovers a URL on a page and makes a descision about it, the URL and the decision are logged.
-
url-extracted
- Events logged when we finish content extraction from a URL (maybe with some basic metadata extracted from the page).
-
url-output
- An event marking the end of URL processing.
Fields common to all URL lifecycle events
editAll URL lifecycle events include the following common fields:
Identification fields:
-
crawler.url.hash
- A unique identifier for the URL as it is handled by the crawler. All events for the same URL within a single crawl share the same hash.
-
crawler.url.source_hash
- A unique identifier of the URL that was used to discover this URL (only used for cases when a URL was discovered during a crawl and not submitted as a seed URL).
URL details:
-
url.full
- The full URL string.
-
url.scheme
- Scheme portion of the URL.
-
url.domain
- Domain portion of the URL.
-
url.port
- Port of the URL.
-
url.path
- Path of the URL.
-
url.query
- URL query string. Included when available.
-
url.fragment
- URL fragment. Included when available.
-
url.username
- Username portion of the URL. Included when available.
-
url.password
- Password portion of the URL. Included when available.
URL seed events
editThese are small events used to track the flow of URLs into the crawler system and are primarily focused on tracking how a specific URL got into the backlog.
-
event.kind
-
Set to
event
. -
event.type
-
Set to
start
. -
event.action
-
Set to
url-seed
. -
crawler.url.type
-
A type of the URL being added:
-
content
for generic content URLs. -
sitemap
for sitemap and sitemap-index URLs. -
feed
for RSS/ATOM feeds.
-
-
crawler.url.source_type
-
A name of the source used for seeding the crawl:
-
seed-list
for seed-list URLs submitted as a part of the crawl configuration. -
organic
for URLs discovered during a crawl by following organic links. -
redirect
for pages discovered by following a redirect. -
canonical-url
for pages discovered via the canonical URL meta tag.
-
-
crawler.url.source_url.id
- Set to the id of the URL the crawler used to discover this page (only for URLs discovered during a crawl).
-
crawler.url.crawl_depth
- A positive number, indicating the number of steps the crawler had to take from our seed URLs set to reach this specific page.
URL fetch events
editThese are the primary events that will be used for troubleshooting networking layer issues with a crawl. They therefore aim to provide enough insight into what happened during a fetch attempt and what were the results.
These events represent a single HTTP request. If the crawler followed redirects, it logs a separate record for each event including information about the redirect response to help with redirect chain troubleshooting.
-
event.kind
-
Set to
event
. -
event.type
-
Set to
access
. -
event.action
-
Set to
url-fetch
.
Event timing and outcome details:
-
event.start
- The start of the HTTP request.
-
event.end
- The end of the HTTP request.
-
event.duration
- Response timing for the HTTP request (total time it took to get the full response).
-
event.outcome
-
An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:
-
failure
- for all 3xx, 4xx and 5xx responses. -
success
- for all 2xx responses. -
unknown
- for network timeouts.
-
HTTP request details:
-
http.request.method
- The method of the request.
HTTP response details:
-
http.response.body.bytes
- The size of the response body in bytes (for successful responses only).
-
http.response.status_code
- A string status code.
HTTP redirect details:
-
crawler.url.redirect.location
-
A
Location
header content for redirect responses. -
crawler.url.redirect.count
- Number of redirects followed so far in a redirect chain (starts with 1 on the first redirect and is increased on each subsequent redirect until a non-redirect response is received or the maximum number of redirects is reached).
URL discover events
editThese are small events used to troubleshoot URL discovery within the crawler. Each time the crawler sees a new URL (extracted from a page or from following a redirect), it logs information about it along with the decision on what will happen to the newly discovered link.
-
event.kind
-
Set to
event
. -
event.action
-
Set to
url-discover
. -
event.type
-
Depending on the decision regarding the URL, set to one of:
-
allowed
if the URL will be added to the backlog for future crawling. -
denied
if the URL will not be followed (themessage
field will have a human-readable explanation of why the crawler decided not to follow it).
-
-
crawler.url.source_type
-
A type of the source used for discovering the link:
-
organic
for URLs discovered during a crawl by following organic links. -
redirect
for pages discovered by following a redirect.
-
-
crawler.url.source_url.id
- Set to the id of the URL the crawler used to discover this page.
-
crawler.url.crawl_depth
- A positive number, indicating the number of steps the crawler had to take from our seed URLs set to reach this specific page.
-
crawler.url.deny_reason
-
A field with a code explaining the reason for skipping a URL during a crawl:
-
link_too_deep
when we hit a crawl depth limit. -
link_too_long
when we hit a URL length limit. -
link_with_too_many_params
when we hit a limit on the number of URL parameters allowed. -
link_with_too_many_segments
when we hit a limit on the number of URL segments allowed. -
queue_full
when we hit a backlog size limit. -
sitemap_denied
when a URL is prohibited from crawling by a sitemap rule. -
domain_filter_denied
for prohibited cross-domain links. -
page_already_visited
for crawl-scoped URL de-duplication events. -
incorrect_protocol
for non-HTTP links and non-HTTPS links in HTTPS-enforced mode.
-
URL extracted events
editThese events are focused on the extraction portion of the crawler process and are logged to help an operator troubleshoot the process of content extraction for the pages on their domains. The primary focus here is capturing the details of the extraction process.
Each event represents a single extractor handling a single piece of content.
-
event.kind
-
Set to
event
. -
event.action
-
Set to
url-extracted
. -
event.module
-
The name of the extractor generating the event (e.g.
html
). -
event.type
-
Depending on the decision regarding the URL, set to one of:
-
allowed
if the URL has been allowed to be indexed. -
denied
if the URL has not been indexed because of a crawl rule, arobots.txt
rule, etc (themessage
field will have a human-readable explanation of what happened).
-
Event timing and outcome details:
-
event.start
- The start of the extraction process.
-
event.end
- The end of the extraction process.
-
event.duration
- End-to-end timing for the extraction process (total time it took to get the data extracted).
-
event.outcome
-
An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:
-
failure
if extraction process failed and we are going to drop the content. -
success
if extraction process succeeded (or failed in a graceful manner).
-
Extraction result details:
-
crawler.extraction.content_type
- Content type for the page.
-
crawler.extraction.content_size.bytes
- The size of the page.
-
crawler.extraction.fields_extracted
- The list of fields extracted.
URL output events
editThese events are designed to capture the results of ingestion of a single piece of content into an external system (file, App Search, etc). The main goal here is to capture any data needed to tie a URL fetched and processed by the crawler to the changes performed in the external system as a result of the crawl.
Each event represents a single output module handling a single piece of content.
-
event.kind
-
Set to
event
. -
event.type
-
Set to
end
. -
event.action
-
Set to
url-output
. -
event.module
-
The name of the output module generating the event (e.g.
file
,app-search
).
Event timing and outcome details:
-
event.start
- The start of the output ingestion process.
-
event.end
- The end of the output ingestion process.
-
event.duration
- End-to-end timing for the output ingestion process (total time it took to get the data processed by the module).
-
event.outcome
-
An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:
-
failure
if output ingestion process failed and we are going to drop the content. -
success
if output ingestion process succeeded (or failed in a graceful manner). -
unknown
for cases specific to an output module.
-
Output ingestion results (file
module):
-
crawler.output.file.directory
- The directory where the event has been logged.
-
crawler.output.file.name
- The name of the file where the event has been logged (base name without the directory).
Output ingestion results (app-search
module):
-
crawler.output.app-search.engine.id
- The id of the engine used to ingest the content.
-
crawler.output.app-search.engine.name
- The name of the engine used to ingest the content.
-
crawler.output.app-search.document_id
- The id of the document within the engine.
-
crawler.output.app-search.content_hash
- The content hash used for de-duplication purposes.
Content ingestion events
editA special kind of event used to troubleshoot the ingestion process. These events are used only by complex output modules and, potentially, only enabled in debug mode or by using a special crawl config option. The goal of these events is to explain the ingestion process results in more details than could be captured by a URL output event.
-
event.kind
-
Set to
event
. -
event.type
-
Set to
info
. -
event.action
-
Set to
ingest-progress
. -
event.module
-
The name of the output module generating the event (e.g.
file
,app-search
). -
message
-
Details on what is happening with the extraction process.
App Search logs URL-scoped events that explain how a specific piece of content from the crawler got ingested into the external system. These are important for troubleshooting cases when the crawler discovers and crawls a URL, but due to App Search de-duplication logic the content does not get ingested, etc.
-
ingest-progress
- An event logged by an output module to help an operator troubleshoot the ingestion process. These are pretty generic events using the message field to explain what is happening.
URL identification fields:
These are used to correlate an ingestion event to the rest of the events generated by the crawler for a specific page:
-
crawler.url.hash
- A unique identifier for the URL as it is handled by the crawler, all events for the same URL within a single crawl share the same hash (since it is calculated as SHA1 hash of the URL itself).
-
url.full
- The full URL string.
-
url.scheme
- Scheme portion of the URL.
-
url.domain
- Domain portion of the URL.
-
url.port
- Port of the URL.
-
url.path
- Path of the URL.
-
url.query
- URL query string. Included when available.
-
url.fragment
- URL fragment. Included when available.
-
url.username
- Username portion of the URL. Included when available.
-
url.password
- Password portion of the URL. Included when available.