Define Custom Workloads: Tracks#
Definition#
A track describes one or more benchmarking scenarios. Its structure is described in detail in the track reference.
Creating a track from data in an existing cluster#
If you already have a cluster with data in it you can use the create-track
subcommand of Rally to create a basic Rally track. To create a Rally track with data from the indices products
and companies
that are hosted by a locally running Elasticsearch cluster, issue the following command:
esrally create-track --track=acme --target-hosts=127.0.0.1:9200 --indices="products,companies" --output-path=~/tracks
Alternatively, to create a Rally track with data from the data streams logs-*
and metrics-*
, issue the following command:
esrally create-track --track=acme --target-hosts=127.0.0.1:9200 --data-streams="metrics-*,logs-*" --output-path=~/tracks
If you want to connect to a cluster with TLS and basic authentication enabled, for example via Elastic Cloud, also specify –client-options and change basic_auth_user
and basic_auth_password
accordingly:
esrally create-track --track=acme --target-hosts=abcdef123.us-central-1.gcp.cloud.es.io:9243 --client-options="use_ssl:true,verify_certs:true,basic_auth_user:'elastic',basic_auth_password:'secret-password'" --indices="products,companies" --output-path=~/tracks
The track generator will create a folder with the track’s name in the specified output directory:
> find tracks/acme
tracks/acme
tracks/acme/challenges
tracks/acme/challenges/default.json
tracks/acme/companies-documents.json
tracks/acme/companies-documents.json.bz2
tracks/acme/companies-documents-1k.json
tracks/acme/companies-documents-1k.json.bz2
tracks/acme/companies.json
tracks/acme/operations
tracks/acme/operations/default.json
tracks/acme/products-documents.json
tracks/acme/products-documents.json.bz2
tracks/acme/products-documents-1k.json
tracks/acme/products-documents-1k.json.bz2
tracks/acme/products.json
tracks/acme/track.json
The files are organized as follows:
track.json
contains the actual Rally track. For details see the track reference.companies.json
andproducts.json
contain the mapping and settings for the extracted indices.*-documents.json(.bz2)
contains the sources of all the documents from the extracted indices. The files suffixed with-1k
contain a smaller version of the document corpus to support test mode.operations/default.json
contains a sample operation that can be used when adding search requests to the track.challenges/default.json
contains the initial actions for the default challenge - delete index, create, and bulk-index.
Creating a track from scratch#
We will create the track “tutorial” step by step. We store everything in the directory ~/rally-tracks/tutorial
but you can choose any other location.
First, get some data. Geonames provides geo data under a creative commons license. Download allCountries.zip (around 300MB), extract it and inspect allCountries.txt
.
The file is tab-delimited but to bulk-index data with Elasticsearch we need JSON. Convert the data with the following script:
import json
cols = (("geonameid", "int", True),
("name", "string", True),
("asciiname", "string", False),
("alternatenames", "string", False),
("latitude", "double", True),
("longitude", "double", True),
("feature_class", "string", False),
("feature_code", "string", False),
("country_code", "string", True),
("cc2", "string", False),
("admin1_code", "string", False),
("admin2_code", "string", False),
("admin3_code", "string", False),
("admin4_code", "string", False),
("population", "long", True),
("elevation", "int", False),
("dem", "string", False),
("timezone", "string", False))
def main():
with open("allCountries.txt", "rt", encoding="UTF-8") as f:
for line in f:
tup = line.strip().split("\t")
record = {}
for i in range(len(cols)):
name, type, include = cols[i]
if tup[i] != "" and include:
if type in ("int", "long"):
record[name] = int(tup[i])
elif type == "double":
record[name] = float(tup[i])
elif type == "string":
record[name] = tup[i]
print(json.dumps(record, ensure_ascii=False))
if __name__ == "__main__":
main()
Store the script as toJSON.py
in the tutorial directory (~/rally-tracks/tutorial
). Invoke it with python3 toJSON.py > documents.json
.
Then store the following mapping file as index.json
in the tutorial directory:
{
"settings": {
"index.number_of_replicas": 0
},
"mappings": {
"dynamic": "strict",
"properties": {
"geonameid": {
"type": "long"
},
"name": {
"type": "text"
},
"latitude": {
"type": "double"
},
"longitude": {
"type": "double"
},
"country_code": {
"type": "text"
},
"population": {
"type": "long"
}
}
}
}
Note
This tutorial assumes that you want to benchmark a version of Elasticsearch 7.0.0 or later. If you want to benchmark Elasticsearch prior to 7.0.0 you need to add the mapping type above so index.json
will look like:
...
"mappings": {
"docs": {
...
}
}
...
For details on the allowed syntax, see the Elasticsearch documentation on mappings and the create index API.
Finally, store the track as track.json
in the tutorial directory:
{
"version": 2,
"description": "Tutorial benchmark for Rally",
"indices": [
{
"name": "geonames",
"body": "index.json"
}
],
"corpora": [
{
"name": "rally-tutorial",
"documents": [
{
"source-file": "documents.json",
"document-count": 11658903,
"uncompressed-bytes": 1544799789
}
]
}
],
"schedule": [
{
"operation": {
"operation-type": "delete-index"
}
},
{
"operation": {
"operation-type": "create-index"
}
},
{
"operation": {
"operation-type": "cluster-health",
"request-params": {
"wait_for_status": "green"
},
"retry-until-success": true
}
},
{
"operation": {
"operation-type": "bulk",
"bulk-size": 5000
},
"warmup-time-period": 120,
"clients": 8
},
{
"operation": {
"operation-type": "force-merge"
}
},
{
"operation": {
"name": "query-match-all",
"operation-type": "search",
"body": {
"query": {
"match_all": {}
}
}
},
"clients": 8,
"warmup-iterations": 1000,
"iterations": 1000,
"target-throughput": 100
}
]
}
The numbers under the documents
property are needed to verify integrity and provide progress reports. Determine the correct document count with wc -l documents.json
. For the size in bytes, use stat -f %z documents.json
on macOS and stat -c %s documents.json
on GNU/Linux.
Note
This tutorial assumes that you want to benchmark a version of Elasticsearch 7.0.0 or later. If you want to benchmark Elasticsearch prior to 7.0.0 you need to add the types
property above so track.json
will look like:
...
"indices": [
{
"name": "geonames",
"body": "index.json",
"types": [ "docs" ]
}
],
...
Note
You can store any supporting scripts along with your track. However, you need to place them in a directory starting with “_”, e.g. “_support”. Rally loads track plugins (see below) from any directory but will ignore directories starting with “_”.
Note
We have defined a JSON schema for tracks which you can use to check how to define your track. You should also check the tracks provided by Rally for inspiration.
The new track appears when you run esrally list tracks --track-path=~/rally-tracks/tutorial
:
$ esrally list tracks --track-path=~/rally-tracks/tutorial
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
Available tracks:
Name Description Documents Compressed Size Uncompressed Size
---------- ----------------------------- ----------- --------------- -----------------
tutorial Tutorial benchmark for Rally 11658903 N/A 1.4 GB
You can also show details about your track with esrally info --track-path=~/rally-tracks/tutorial
:
$ esrally info --track-path=~/rally-tracks/tutorial
____ ____
/ __ \____ _/ / /_ __
/ /_/ / __ `/ / / / / /
/ _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
/____/
Showing details for track [tutorial]:
* Description: Tutorial benchmark for Rally
* Documents: 11,658,903
* Compressed Size: N/A
* Uncompressed Size: 1.4 GB
Schedule:
----------
1. delete-index
2. create-index
3. cluster-health
4. bulk (8 clients)
5. force-merge
6. query-match-all (8 clients)
Congratulations, you have created your first track! You can test it with esrally race --distribution-version=7.14.1 --track-path=~/rally-tracks/tutorial
.
Note
To test the track with Elasticsearch prior to 7.0.0 you need to update index.json
and track.json
as specified in notes above and then execute esrally race --distribution-version=6.5.3 --track-path=~/rally-tracks/tutorial
.
Adding support for test mode#
You can check your track very quickly for syntax errors when you invoke Rally with --test-mode
.
In test mode Rally postprocesses its internal track representation as follows:
Iteration-based tasks run at most one warmup iteration and one measurement iteration.
Time-period-based tasks run at most for 10 seconds without warmup.
In test mode Rally also post-processes all data file names:
If
documents.json
has 1000 documents or fewer, Rally uses it (no modifications).If
documents.json
has more than 1000 documents, Rally assumes an additionaldocuments-1k.json
file is present and uses it. You need to prepare these additional files manually. Pick 1000 documents for every data file in your track and store them in a file with the-1k
suffix. On Linux you can do it as follows:head -n 1000 documents.json > documents-1k.json
.
Challenges#
To specify different workloads in the same track you can use so-called challenges. Instead of specifying the schedule
property on top-level you specify a challenges
array:
{
"version": 2,
"description": "Tutorial benchmark for Rally",
"indices": [
{
"name": "geonames",
"body": "index.json"
}
],
"corpora": [
{
"name": "rally-tutorial",
"documents": [
{
"source-file": "documents.json",
"document-count": 11658903,
"uncompressed-bytes": 1544799789
}
]
}
],
"challenges": [
{
"name": "index-and-query",
"default": true,
"schedule": [
{
"operation": {
"operation-type": "delete-index"
}
},
{
"operation": {
"operation-type": "create-index"
}
},
{
"operation": {
"operation-type": "cluster-health",
"request-params": {
"wait_for_status": "green"
},
"retry-until-success": true
}
},
{
"operation": {
"operation-type": "bulk",
"bulk-size": 5000
},
"warmup-time-period": 120,
"clients": 8
},
{
"operation": {
"operation-type": "force-merge"
}
},
{
"operation": {
"name": "query-match-all",
"operation-type": "search",
"body": {
"query": {
"match_all": {}
}
}
},
"clients": 8,
"warmup-iterations": 1000,
"iterations": 1000,
"target-throughput": 100
}
]
}
]
}
Note
If you define multiple challenges, Rally runs the challenge where default
is set to true
. If you want to run a different challenge, provide the command line option --challenge=YOUR_CHALLENGE_NAME
.
Note
To use the track with Elasticsearch prior to 7.0.0 you need to update index.json
and track.json
with index and mapping types accordingly as specified in notes above.
When should you use challenges? Challenges are useful when you want to run completely different workloads based on the same track but for the majority of cases you should get away without using challenges:
To run only a subset of the tasks, you can use task filtering, e.g.
--include-tasks="create-index,bulk"
will only run these two tasks in the track above or--exclude-tasks="bulk"
will run all tasks except forbulk
.To vary parameters, e.g. the number of clients, you can use track parameters
Structuring your track#
track.json
is the entry point to a track but you can split your track as you see fit. Suppose you want to add more challenges to the track but keep them in separate files. Create a challenges
directory and store the following in challenges/index-and-query.json
:
{
"name": "index-and-query",
"default": true,
"schedule": [
{
"operation": {
"operation-type": "delete-index"
}
},
{
"operation": {
"operation-type": "create-index"
}
},
{
"operation": {
"operation-type": "cluster-health",
"request-params": {
"wait_for_status": "green"
},
"retry-until-success": true
}
},
{
"operation": {
"operation-type": "bulk",
"bulk-size": 5000
},
"warmup-time-period": 120,
"clients": 8
},
{
"operation": {
"operation-type": "force-merge"
}
},
{
"operation": {
"name": "query-match-all",
"operation-type": "search",
"body": {
"query": {
"match_all": {}
}
}
},
"clients": 8,
"warmup-iterations": 1000,
"iterations": 1000,
"target-throughput": 100
}
]
}
Include the new file in track.json
:
{
"version": 2,
"description": "Tutorial benchmark for Rally",
"indices": [
{
"name": "geonames",
"body": "index.json"
}
],
"corpora": [
{
"name": "rally-tutorial",
"documents": [
{
"source-file": "documents.json",
"document-count": 11658903,
"uncompressed-bytes": 1544799789
}
]
}
],
"challenges": [
{% include "challenges/index-and-query.json" %}
]
}
We replaced the challenge content with {% include "challenges/index-and-query.json" %}
which tells Rally to include the challenge from the provided file. You can use include
on arbitrary parts of your track.
To reuse operation definitions across challenges, you can define them in a separate operations
block and refer to them by name in the corresponding challenge:
{
"version": 2,
"description": "Tutorial benchmark for Rally",
"indices": [
{
"name": "geonames",
"body": "index.json"
}
],
"corpora": [
{
"name": "rally-tutorial",
"documents": [
{
"source-file": "documents.json",
"document-count": 11658903,
"uncompressed-bytes": 1544799789
}
]
}
],
"operations": [
{
"name": "delete",
"operation-type": "delete-index"
},
{
"name": "create",
"operation-type": "create-index"
},
{
"name": "wait-for-green",
"operation-type": "cluster-health",
"request-params": {
"wait_for_status": "green"
},
"retry-until-success": true
},
{
"name": "bulk-index",
"operation-type": "bulk",
"bulk-size": 5000
},
{
"name": "force-merge",
"operation-type": "force-merge"
},
{
"name": "query-match-all",
"operation-type": "search",
"body": {
"query": {
"match_all": {}
}
}
}
],
"challenges": [
{% include "challenges/index-and-query.json" %}
]
}
challenges/index-and-query.json
then becomes:
{
"name": "index-and-query",
"default": true,
"schedule": [
{
"operation": "delete"
},
{
"operation": "create"
},
{
"operation": "wait-for-green"
},
{
"operation": "bulk-index",
"warmup-time-period": 120,
"clients": 8
},
{
"operation": "force-merge"
},
{
"operation": "query-match-all",
"clients": 8,
"warmup-iterations": 1000,
"iterations": 1000,
"target-throughput": 100
}
]
}
Note how we reference to the operations by their name (e.g. create
, bulk-index
, force-merge
or query-match-all
).
You can also use Rally’s collect helper to simplify including multiple challenges:
{% import "rally.helpers" as rally %}
{
"version": 2,
"description": "Tutorial benchmark for Rally",
"indices": [
{
"name": "geonames",
"body": "index.json"
}
],
"corpora": [
{
"name": "rally-tutorial",
"documents": [
{
"source-file": "documents.json",
"document-count": 11658903,
"uncompressed-bytes": 1544799789
}
]
}
],
"operations": [
{
"name": "delete",
"operation-type": "delete-index"
},
{
"name": "create",
"operation-type": "create-index"
},
{
"name": "wait-for-green",
"operation-type": "cluster-health",
"request-params": {
"wait_for_status": "green"
},
"retry-until-success": true
},
{
"name": "bulk-index",
"operation-type": "bulk",
"bulk-size": 5000
},
{
"name": "force-merge",
"operation-type": "force-merge"
},
{
"name": "query-match-all",
"operation-type": "search",
"body": {
"query": {
"match_all": {}
}
}
}
],
"challenges": [
{{ rally.collect(parts="challenges/*.json") }}
]
}
The changes are:
We import helper functions from Rally by adding
{% import "rally.helpers" as rally %}
in line 1.We use Rally’s
collect
helper to find and include all JSON files in thechallenges
subdirectory with the statement{{ rally.collect(parts="challenges/*.json") }}
.
Note
Rally’s log file contains the fully rendered track after it has loaded it successfully.
You can even use Jinja2 variables but then you need to import the Rally helpers a bit differently. You also need to declare all variables before the import
statement:
{% set clients = 16 %}
{% import "rally.helpers" as rally with context %}
If you use this idiom you can refer to the clients
variable inside your snippets with {{ clients }}
.