See the detailed analysis https://docs.google.com/document/d/1efVAMcEw7-R_KatHHcobcFBlNsre-DoThVHI8AO2SDQ/edit?tab=t.0
I ran extensive benchmarks using synthetic data as well as real WAL segments pulled from the prombench runs.
All benchmarks are here https://github.com/prometheus/prometheus/compare/bwplotka/wal-reuse?expand=1
* optimization(tsdb/wlog): reuse Ref* buffers across WAL watchers' reads
Signed-off-by: bwplotka <bwplotka@gmail.com>
* optimization(tsdb/wlog): avoid expensive error wraps
Signed-off-by: bwplotka <bwplotka@gmail.com>
* optimization(tsdb/wlog): reuse array for filtering
Signed-off-by: bwplotka <bwplotka@gmail.com>
* fmt
Signed-off-by: bwplotka <bwplotka@gmail.com>
* lint fix
Signed-off-by: bwplotka <bwplotka@gmail.com>
* tsdb/record: add test for clear() on histograms
Signed-off-by: bwplotka <bwplotka@gmail.com>
* updated WriteTo with what's currently expected
Signed-off-by: bwplotka <bwplotka@gmail.com>
---------
Signed-off-by: bwplotka <bwplotka@gmail.com>
Initial implementation of https://github.com/prometheus/prometheus/issues/17790.
Only implements ST-per-sample for Counters. Tests and benchmarks updated.
Note: This increases the size of the RefSample object for all users, whether st-per-sample is turned on or not.
Signed-off-by: Owen Williams <owen.williams@grafana.com>
* simplify readability of timeseries filtering by using the slices package
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* ensure that BenchmarkBuildTimeSeries doesn't account for the building of
the actual proto in the benchmark results, we only care about the
buildTimeSeries call
Signed-off-by: Callum Styan <callumstyan@gmail.com>
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
control over time
The test becomes flaky after it was asked to run on parallel
and "fight" for resources
let's hide all of that
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Remote Write one currently attempts to send native histograms with
custom buckets, but these are not actually supported in RW1 protocol.
Drop, measure and log instead.
Fixes: #17140
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Because of relabelling, an endpoint can only select a subset of series
that go through WriteStorage
Having a highestTimestamp at WriteStorage level yields wrong values
if the corresponding sample won't even make it to a remote queue.
Currently PrometheusRemoteWriteBehind is based on that, and would fire
if an endpoint is only interested in a subset of series that take time
to appear.
A "prometheus_remote_storage_queue_highest_timestamp_seconds" that only
takes into account samples in the queue is introduced, and used in
PrometheusRemoteWriteBehind and dashboards in documentation/prometheus-mixin
Same applies to samplesIn/dataIn, QueueManager should know more about
when to update those; when data is enqueued.
That makes dataDropped unnecessary, thus help simplify the logic
in QueueManager.calculateDesiredShards()
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
- The tool left an empty line behind that we don't need anymore, see
https://github.com/prometheus/prometheus/pull/17092. (Arguably not a
bug in the tool but just our stricter style about empty lines.)
- In tsdb/index/postings_test.go , our (admittedly somewhat
convoluted) code structure tricked the tool so it spit out something
that wouldn't even compile.
- storage/remote/queue_manager_test.go is just a minor formatting
nit.
Signed-off-by: beorn7 <beorn@grafana.com>
See
https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize
for details.
This ran into a few issues (arguably bugs in the modernize tool),
which I will fix in the next commit, so that we have transparency what
was done automatically.
Beyond those hiccups, I believe all the changes applied are
legitimate. Even where there might be no tangible direct gain, I would
argue it's still better to use the "modern" way to avoid micro
discussions in tiny style PRs later.
Signed-off-by: beorn7 <beorn@grafana.com>
add metric to track unexpected metadata seen in populateV2TimeSeries, which would indicate metadata incorrectly routed in queue_manager code paths
---------
Signed-off-by: leegin <leegin.t@gmail.com>
Signed-off-by: Darkknight <leegin.t@gmail.com>
A race condition in TestSendSamplesWithBackoffWithSampleAgeLimit was
observed in CI where the sample age limit was too close to the backoff
time, causing samples to be dropped intermittently. Increasing the
SampleAgeLimit resolves the problem.
Signed-off-by: Adam Bernot <bernot@google.com>
As mentioned in #16182, the BenchmarkStartup test for the queue manager
covers an old API and uses settings that will not occur in production
Signed-off-by: Adam Bernot <bernot@google.com>
- Wrapped existing test logic in a loop to run with both protocol versions
- Ensures consistent behavior across protocol versions for dropping old time series
Signed-off-by: AxcelXander <tyz666@bu.edu>
Co-authored-by: AxcelXander <tyz666@bu.edu>
Rationales:
* metadata-wal-records might be deprecated and replaced going forward: https://github.com/prometheus/prometheus/issues/15911
* PRW 2.0 works without metadata just fine (although it sends untyped metrics as expected).
Signed-off-by: bwplotka <bwplotka@gmail.com>
The was a bug (due to confusion?) on the local metadata cache that is cached
by metric family not the series metric name. The fix is to NOT use that local cache
at all (it's still needed for current metadata API implementation, added TODO
on how we can get rid of it).
I went ahead and also rename Metric field in metadata structs to MetricFamily to make
clear it's not always __name__.
Signed-off-by: bwplotka <bwplotka@gmail.com>
It was crashing due to uninitialized metrics, and not terminating due to
incorrectly reading segment names.
We need to export `SetMetrics` to avoid the first problem.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>