Elasticsearch Query Language (ES|QL) is designed for fast, efficient querying of large datasets. It has a straightforward syntax which will allow you to write complex queries easily, with a pipe based language, reducing the learning curve. We're going to use ES|QL to run statistical analysis and compare different odds.
If you are reading this, you probably want to know how rich you can get before actually reaching the same odds of being hit by a bus. I can't blame you, I want to know too. Let's work out the odds so that we can make sure we win the lottery rather than get in an accident!
What we are going to see in this blog is figuring out the probability of being hit by a bus and the probability of achieving wealth. We'll then compare both and understand until what point your chances of getting rich are higher, and when you should consider getting life insurance.
So how are we going to do that? This is going to be a mix of magic numbers pulled from different articles online, some synthetics data and the power of ES|QL, the new Elasticsearch Query Language. Let's get started.
Data for the ES|QL analysis
The magic number
The challenge starts here as the dataset is going to be somewhat challenging to find. We are then going to assume for the sake of the example that ChatGPT is always right. Let’s see what we get for the following question:

Cough Cough… That sounds about right, this is going to be our magic number.
Generating the wealth data
Prerequisites
Before running any of the scripts below, make sure to install the following packages:
elasticsearch==8.14.0matplotlibnumpypandascipy
Now, there is one more thing we need, a representative dataset with wealth distribution to compute wealth probability. There is definitely some portion of it here and there, but again, for the example we are going to generate a 500K line dataset with the below python script. I am using python 3.11.5 in this example:
import pandas as pdimport numpy as npimport getpassfrom elasticsearch import Elasticsearch, helpers
# Input the Elasticsearch hosthosts = input('Enter your Elasticsearch host address : ')
# Securely input the Elasticsearch API keyapi_key = getpass.getpass(prompt='Enter your Elasticsearch API Key: ')
# Initialize Elasticsearch clientclient = Elasticsearch( hosts=hosts, api_key=api_key,)
# Generate synthetic data with a highly skewed distributionnum_records = 500000np.random.seed(42) # Ensure reproducibility
# Generate net worth using a highly skewed distributionages = np.random.randint(20, 80, num_records) # Random ages between 20 and 80incomes = np.random.exponential(scale=10000, size=num_records) # Exponential distribution for income# Use a more skewed distribution for net worth with a much larger rangenet_worths = np.random.exponential(scale=100000000, size=num_records) # Extremely skewed net worth
# Scale up the net worths to reach up to $100 billionnet_worths = np.clip(net_worths, 0, 100000000000)
# Create DataFramedf = pd.DataFrame({ 'id': range(1, num_records + 1), 'age': ages, 'income': incomes, 'net_worth': net_worths, 'counter': range(1, num_records + 1) # Add a counter field for pagination})
# Index the data into Elasticsearchindex_name = 'raw_wealth_data_large'try: if client.indices.exists(index=index_name): client.indices.delete(index=index_name)except exceptions.NotFoundError: passclient.indices.create(index=index_name)
def generator(df): for index, row in df.iterrows(): yield { "_index": index_name, "_source": row.to_dict() }
helpers.bulk(client, generator(df))
print("Data indexed successfully.")
It should take some time to run depending on your configuration since we are injecting 500K documents here!
FYI, after playing with a couple of versions of the script above and the ESQL query on the synthetic data, it was obvious that the net worth generated across the population was not really representative of the real world. So I decided to use a log-normal distribution (np.random.lognormal) for income to reflect a more realistic spread where most people have lower incomes, and fewer people have very high incomes.
Net Worth Calculation: Used a combination of random multipliers (np.random.uniform(0.5, 5)) and additional noise (np.random.normal(0, 10000)) to calculate net worth. Added a check to ensure no negative net worth values by using np.maximum(0, net_worths).
Not only have we generated 500K documents, but we also used the Elasticsearch python client to bulk ingest all these documents in our deployment. Please note that you will find the endpoint to pass in as hosts Cloud ID in the code above.
For the deployment API key, open Kibana, and generate the key in Stack Management / API Keys:

The good news is that if you have a real data set, all you will need to do is to change the above code to read your dataset and write documents with the same data mapping.
Ok we're getting there! The next step is pouring our wealth distribution.
ES|QL wealth analysis
Introducing ES|QL: A powerful tool for data analysis
The arrival of Elasticsearch Query Language (ES|QL) is very exciting news for our users. It largely simplifies querying, analyzing, and visualizing data stored in Elasticsearch, making it a powerful tool for all data-driven use cases.
ES|QL comes with a variety of functions and operators, to perform aggregations, statistical analyses, and data transformations. We won’t address them all in this blog post, however our documentation is very detailed and will help you familiarize with the language and the possibilities.
To get started with ES|QL today and run the blog post queries, simply start a trial on Elastic Cloud, load the data and run your first ES|QL query.
Understanding the wealth distribution with our first query
To get familiar with the dataset, head to Discover in Kibana and switch to ES|QL in the dropdown on the left hand side:

Let’s fire our first request:
from raw_wealth_data_large | keep age, id, income, net_worth | limit 10

As you could expect from our indexing script earlier, we are finding the documents we bulk ingested, notice the simplicity of pulling data from a given dataset with ES|QL where every query starts with the From clause, then your index.
In the query above given we have 500K lines, we limited the amount of returned documents to 10. To do this, we are passing the output of the first segment of the query via a pipe to the limit command to only get 10 results. Pretty intuitive, right?
Alright, what would be more interesting is to understand the wealth distribution in our dataset, for this we will leverage one of the 30 functions ES|QL provides, namely percentile.
This will allow us to understand the relative position of each data point within the distribution of net worth. By calculating the median percentile (50th percentile), we can gauge where an individual’s net worth stands compared to others.
FROM raw_wealth_data_large| stats p50 = percentile(net_worth, 50)
Like our first query, we are passing the output of our index to another function, Stats, which combined with the percentile function will output the median net worth:

The median is about 54K, which unfortunately is probably optimistic compared to the real world, but we are not going to solve this here. If we go a little further, we can look at the distribution in more granularity by computing more percentiles:
FROM raw_wealth_data_large| STATS p25 = percentile(net_worth, 25) , p50 = percentile(net_worth, 50) , p75 = percentile(net_worth, 75) , p90 = percentile(net_worth, 90) , p95 = percentile(net_worth, 95) , p96 = percentile(net_worth, 96) , p98 = percentile(net_worth, 98) , p97 = percentile(net_worth, 97) , p99 = percentile(net_worth, 99)| keep p25, p25, p50, p75, p90, p95, p96, p97, p98, p99
With the below output:

The data reveals a significant disparity in wealth distribution, with the majority of wealth being concentrated among the richest individuals. Specifically, the top 5% (95th percentile) possess a disproportionately large portion of the total wealth, with a net worth starting at $852,988.26 and increasing dramatically in the higher percentiles.
The 99th percentile individuals hold a net worth exceeding $2 million, highlighting the skewed nature of wealth distribution. This indicates that a substantial portion of the population has modest net worth, which is probably what we want for this example.
Another way to look at this is to augment the previous query and grouping by age to see if there is, (in our synthetic dataset), a relation between wealth and age:
FROM raw_wealth_data_large| STATS p25 = percentile(net_worth, 25) , p50 = percentile(net_worth, 50) , p75 = percentile(net_worth, 75) , p90 = percentile(net_worth, 90) , p95 = percentile(net_worth, 95) , p96 = percentile(net_worth, 96) , p98 = percentile(net_worth, 98) , p97 = percentile(net_worth, 97) , p99 = percentile(net_worth, 99) by age| keep p25, p25, p50, p75, p90, p95, p96, p97, p98, p99, age
This could be visualized in a Kibana dashboard. Simply:
- Navigate to Dashboard
- Add a new ES|QL visualization
- Copy and paste our query
- Move the age field to the horizontal axis in the visualization configuration

Which will output:

The above suggests that the data generator randomized wealth uniformly across the population age, there is no specific trend pattern we can really see.
Median Absolute Deviation (MAD)
We calculate the median absolute deviation (MAD) to measure the variability of net worth in a robust manner, less influenced by outliers.
FROM raw_wealth_data_large| stats median_net_worth = MEDIAN(net_worth), mad_net_worth = MEDIAN_ABSOLUTE_DEVIATION(net_worth)| keep median_net_worth, mad_net_worth

With a median net worth of 44,205.44, we can infer the typical range of Net Worth: Most individuals’ net worth falls within a range of 9,581.78 to $97,992.66.
The statistical showdown between Net Worth and Bus Collision
Alright, this is the moment to understand how rich we can get, based on our dataset, before getting hit by a bus. To do that, we are going to leverage ES|QL to pull our entire dataset in chunks and load it into a pandas dataframe to build a net worth probability distribution. Finally, we will determine where the ends meet between the net worth and bus collision probabilities.
The entire Python notebook is available here. I also recommend you read this blog post which walks you through using ES|QL with pandas dataframes.
Helper functions
As you can see in the previously referred blog post, we introduced support for ES|QL since version 8.12 of the Elasticsearch python client. Thus our notebook first defines the below functions:
from io import StringIO
# Function to execute ESQL query and fetch data in chunksdef execute_esql_query(query): response = client.esql.query(query=query, format="csv") return pd.read_csv(StringIO(response.body))
# Function to fetch paginated data using the counter fielddef fetch_paginated_data(index, num_records, size=10000): all_data = pd.DataFrame() for start in range(1, num_records + 1, size): end = start + size - 1 query = f""" FROM {index} | WHERE counter >= {start} AND counter <= {end} | limit {size} """ data_chunk = execute_esql_query(query) all_data = pd.concat([all_data, data_chunk], ignore_index=True) return all_data
The first function is straightforward and executes an ES|QL query, the second is fetching the entire dataset from our index. Notice the trick in there that I am using a counter built-in to a field in my index to paginate through the data. This is workaround I am using while our engineering team is working on the support for pagination in ES|QL.
Next, knowing that we have 500K documents in our index, we simply call these function to load the data in a data frame:
# Fetch all data using pagination and ES|QLnum_records = 500000all_data_df = fetch_paginated_data(index_name, num_records)print(f"Total Data Retrieved: {len(all_data_df)} records")
Fit Pareto distribution
Next, we fit our data to a Pareto distribution, which is often used to model wealth distribution because it reflects the reality that a small percentage of the population controls most of the wealth. By fitting our data to this distribution, we can more accurately represent the probabilities of different net worth levels.
from scipy.stats import pareto
# Fit a Pareto distribution to the datashape, loc, scale = pareto.fit(all_data_df['net_worth'], floc=0)
# Calculate the probability density for each net worthall_data_df['net_worth_probability'] = pareto.pdf(all_data_df['net_worth'], shape, loc=loc, scale=scale)
# Normalize the probabilities to sum to 1all_data_df['net_worth_probability'] /= all_data_df['net_worth_probability'].sum()
print("Data with Net Worth Probability:")print(all_data_df.head())
We can visualize the pareto distribution with the code below: ``
import matplotlib.pyplot as pltfrom scipy.stats import pareto
# Assuming all_data_df contains the fetched net worth data from Elasticsearch# Fit a Pareto distribution to the datashape, loc, scale = pareto.fit(all_data_df['net_worth'], floc=0)
# Plot the Net Worth Probability Distributionplt.figure(figsize=(10, 6))
# Plot histogram of empirical net worth dataplt.hist(all_data_df['net_worth'], bins=100, density=True, alpha=0.6, color='g', label='Empirical Data')
# Plot fitted Pareto distributionxmin, xmax = plt.xlim()x = np.linspace(xmin, xmax, 100)p = pareto.pdf(x, shape, loc=loc, scale=scale)plt.plot(x, p, 'k', linewidth=2, label='Fitted Pareto Distribution')
# Show the plotplt.xlabel('Net Worth')plt.y bnblabel('Probability')plt.title('Net Worth Probability Distribution')plt.legend()plt.grid(True)plt.show()

Breaking point
Finally, with the calculated probability, we determine the target net worth corresponding to the bus hit probability and visualize it. Remember, we use the magic number ChatGPT gave us for the probability of getting hit by a bus:
# Find the Net Worth Corresponding to the Bus Hit Probabilitytarget_probability = 0.0000181cumulative_probability = all_data_df['net_worth_probability'].cumsum()target_net_worth_df = all_data_df[cumulative_probability >= target_probability].head(1)target_net_worth = target_net_worth_df['net_worth'].iloc[0]print(f"Net Worth with Probability >= {target_probability}: {target_net_worth}")
# Plot the Net Worth Probability Distributionplt.figure(figsize=(10, 6))plt.hist(all_data_df['net_worth'], bins=100, density=True, alpha=0.6, color='g', label='Empirical Data')xmin, xmax = plt.xlim()x = np.linspace(xmin, xmax, 100)p = pareto.pdf(x, shape, loc=loc, scale=scale)plt.plot(x, p, 'k', linewidth=2, label='Fitted Pareto Distribution')plt.axhline(y=target_probability, color='r', linestyle='--', label='Bus Hit Probability')plt.axvline(x=target_net_worth, color='g', linestyle='--', label=f'Net Worth = {target_net_worth:.2f}')plt.xlabel('Net Worth')plt.ylabel('Probability')plt.title('Net Worth Probability Distribution')plt.legend()plt.grid(True)plt.show()
Conclusion
Based on our synthetic dataset, this chart vividly illustrates that the probability of amassing a net worth of approximately $12.5 million is as rare as the chance of being hit by a bus. For the fun of it, let’s ask ChatGPT what the probability is:


Okay… $439 million? I think ChatGPT might be hallucinating again.
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.