IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
Web crawler schema
editWeb crawler schema
editThe web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.
-
additional_urls
- The URLs of additional pages with the same content.
-
body_content
-
The content of the page’s
<body>
tag with all HTML tags removed. Truncated tocrawler.extraction.body_size.limit
. -
domains
- The domains in which this content appears.
-
headings
-
The text of the page’s HTML headings (
h1
-h6
elements). Limited bycrawler.extraction.headings_count.limit
. -
id
- The unique identifier for the page.
-
last_crawled_at
- The date and time when the page was last crawled.
-
links
-
Links found on the page.
Limited by
crawler.extraction.indexed_links_count.limit
. -
meta_description
-
The page’s description, taken from the
<meta name="description">
tag. Truncated tocrawler.extraction.description_size.limit
. -
meta_keywords
-
The page’s keywords, taken from the
<meta name="keywords">
tag. Truncated tocrawler.extraction.keywords_size.limit
. -
title
-
The title of the page, taken from the
<title>
tag. Truncated tocrawler.extraction.title_size.limit
. -
url
- The URL of the page.
-
url_host
- The hostname or IP from the page’s URL.
-
url_path
- The full pathname from the page’s URL.
-
url_path_dir1
- The first segment of the pathname from the page’s URL.
-
url_path_dir2
- The second segment of the pathname from the page’s URL.
-
url_path_dir3
- The third segment of the pathname from the page’s URL.
-
url_port
- The port number from the page’s URL (as a string).
-
url_scheme
- The scheme of the page’s URL.
In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.