WARNING: Version 5.3 of the Elastic Stack has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
X-Pack graph Troubleshooting
editX-Pack graph Troubleshooting
editWhy are results missing?
editThe default settings in Graph API requests are configured to tune out noisy results by using the following strategies:
- Only looking at samples of the most-relevant documents for a query
- Only considering terms that have a significant statistical correlation with the sample
- Only considering terms to be paired that have at least 3 documents asserting that connection
These are useful defaults for "getting the big picture" signals from noisy data but for detailed forensic type work these default settings could miss details from individual documents. To ensure a graph exploration produces all data consider changing the following settings:
-
Increase the
sample_size
to a larger number of documents to analyse more data on each shard. -
Set the
use_significance
setting tofalse
to retrieve terms regardless of any statistical correlation with the sample -
Set the
min_doc_count
for your vertices to 1 to ensure only one document is required to assert a relationship.
What can I do to to improve performance?
editWith the default setting of use_significance
set to true
the Graph API will be performing a background frequency check of the terms
it discovers as part of exploration. Each unique term has to have its frequency looked up in the index which costs at least one disk seek.
Disk seeks are expensive so if the noise-filtering aspects of the Graph API are not required then setting the use_significance
setting
to false
will eliminate all of these expensive checks (but also any quality-filtering of terms).
If the significance noise-filtering features are required there are three ways to reduce the number of checks it performs:
-
Consider less documents by decreasing the
sample_size
. Considering less documents can actually be better if the quality of matches is quite variable. - Avoid noisy documents that have very many terms. This can be achieved either through allowing ranking to naturally favour shorter documents in our top-results sample (see enabling norms) or by explicitly excluding large documents using criteria in the seed and guiding queries passed to the Graph API. Many many terms occur very infrequently so even increasing the frequency threshold by one should massively reduce the number of candidate terms whose background frequencies will be checked.
The downside of all of these tweaks is that they reduce the scope of information analysed and can increase the potential to miss what could be interesting details. The information we lose however tends to be lower-quality documents with lower-frequency terms and so can be an acceptable trade-off.