OpenSearch integration improvements#139
OpenSearch integration improvements#139filipecosta90 wants to merge 22 commits intoqdrant:masterfrom
Conversation
…ports it (supported for Lucene in OpenSearch version 2.13 and later)
… for anonymous_auth feature of opensearch)
… for anonymous_auth feature of opensearch)
LukasWestholt
left a comment
There was a problem hiding this comment.
Thank you for your commits. I want to compare search and upload performance from old and new implementation and hope this will get merged soon.
| @@ -0,0 +1,87 @@ | |||
| [ | |||
| { | |||
| "name": "opensearch-default", | |||
There was a problem hiding this comment.
You need to rename the single-shard experiments from
"opensearch-" to "opensearch-single-shard-" so both are accessable.
There was a problem hiding this comment.
number_of_shards=1 is according to this already the default. Does those experiments then even do something different to opensearch-single-node-default-index.json?
There was a problem hiding this comment.
Should we maybe say number_of_replicas=0 for optimization? see https://repost.aws/knowledge-center/opensearch-indexing-performance
There was a problem hiding this comment.
Ah i see: number_of_replicas=0 is already set.
| if "number_of_shards" in index_config: | ||
| index_settings["number_of_shards"] = 1 |
There was a problem hiding this comment.
"Tuples don't support item assignment"
I suggest:
index_settings = {
"knn": True,
"number_of_replicas": 0,
"refresh_interval": -1, # no refresh is required because we index all the data at once
}
index_config = collection_params.get("index")
# if we specify the number_of_shards on the config, enforce it. otherwise use the default
if "number_of_shards" in index_config:
index_settings["number_of_shards"] = 1
|
No real performace difference between old and new implementation. --- ../results/opensearch-old/output.json 2024-10-02 15:05:57.197185400 +0200
+++ ../results/opensearch-new/output.json 2024-10-02 15:52:09.001055200 +0200
@@ -1,59 +1,59 @@
[
{
"engine_name": "opensearch",
"setup_name": "opensearch-m-16-ef-128",
"dataset_name": "glove-100-angular",
- "upload_time": 505.1072022999997,
- "total_upload_time": 506.65230710000014,
+ "upload_time": 571.6861197999997,
+ "total_upload_time": 571.9920295000002,
"parallel": 1,
"engine_params": {
"knn.algo_param.ef_search": 128
},
- "mean_time": 0.08039978975999684,
- "mean_precisions": 0.819351,
- "std_time": 0.07328582483094778,
- "min_time": 0.021495099999810918,
- "max_time": 4.868069999999534,
- "rps": 12.39949293721894,
- "p95_time": 0.10517231999988325,
- "p99_time": 0.18999490199986213
+ "mean_time": 0.07982564825000299,
+ "mean_precisions": 0.8167789999999999,
+ "std_time": 0.07186505068791486,
+ "min_time": 0.008940600000642007,
+ "max_time": 6.208381400001599,
+ "rps": 12.489320049897197,
+ "p95_time": 0.10064215000093098,
+ "p99_time": 0.21390265599933658
},
{
"engine_name": "opensearch",
"setup_name": "opensearch-m-16-ef-128",
"dataset_name": "glove-100-angular",
- "upload_time": 505.1072022999997,
- "total_upload_time": 506.65230710000014,
+ "upload_time": 571.6861197999997,
+ "total_upload_time": 571.9920295000002,
"parallel": 1,
"engine_params": {
"knn.algo_param.ef_search": 256
},
- "mean_time": 0.08500446872999737,
- "mean_precisions": 0.816762,
- "std_time": 0.07344455333790761,
- "min_time": 0.026552799999990384,
- "max_time": 4.793960500000139,
- "rps": 11.71986972688328,
- "p95_time": 0.1365703799996481,
- "p99_time": 0.28884295299986684
+ "mean_time": 0.08107645404999203,
+ "mean_precisions": 0.814407,
+ "std_time": 0.020966374585830613,
+ "min_time": 0.028347600000415696,
+ "max_time": 1.7499939999997878,
+ "rps": 12.30329778171997,
+ "p95_time": 0.09037325000108466,
+ "p99_time": 0.09874180600041654
},
{
"engine_name": "opensearch",
"setup_name": "opensearch-m-16-ef-128",
"dataset_name": "glove-100-angular",
- "upload_time": 505.1072022999997,
- "total_upload_time": 506.65230710000014,
+ "upload_time": 571.6861197999997,
+ "total_upload_time": 571.9920295000002,
"parallel": 1,
"engine_params": {
"knn.algo_param.ef_search": 512
},
- "mean_time": 0.07950227081000166,
- "mean_precisions": 0.816762,
- "std_time": 0.027110040761532377,
- "min_time": 0.027317600000060338,
- "max_time": 1.4170608000003995,
- "rps": 12.539646869542393,
- "p95_time": 0.09963219500077684,
- "p99_time": 0.14407165300047384
+ "mean_time": 0.07992414966999531,
+ "mean_precisions": 0.814407,
+ "std_time": 0.010356757441751704,
+ "min_time": 0.025327800000013667,
+ "max_time": 0.42032800000015413,
+ "rps": 12.479517153224734,
+ "p95_time": 0.08913765000097555,
+ "p99_time": 0.09884917200071869
}
]
\ No newline at end of file |
| # Update the index settings back to the default | ||
| refresh_interval = "1s" | ||
| cls.client.indices.put_settings( | ||
| index=OPENSEARCH_INDEX, | ||
| params={ | ||
| "timeout": 300, | ||
| }, | ||
| body={"index": {"refresh_interval": refresh_interval}}, |
There was a problem hiding this comment.
Is cls.client.indices.refresh(index=OPENSEARCH_INDEX) better?
There was a problem hiding this comment.
i believe it's best as is, meaning:
- we disable refresh during indexing
- we enable it back after it
|
Please check this PR: #214 for Opensearch improvements and let me know if anything else is needed. I am one of the maintainer of Opensearch and happy to contribute on the improvements for Opensearch in this tool |
The bellow changes aim to: