Cleaning old measures in Gnocchi on waitingforcode.com

Versions: Gnocchi 4.2

The specificity of Gnocchi is the precomputation of the measures. It doesn't allow ad-hoc queries but in the other side provides pretty good reading performance. However, as new time series points are coming, the old ones aren't kept with them.

This post focuses on the aspect of data cleaning in Gnocchi. The first section presents MetricJanitor, a daemon responsible for removing old and unused measures. The second part talks about the data removal when new measures are handled. The final section contains some examples showing the concepts presented in 2 previous parts.

MetricJanitor

The role of MetricJanitor is to delete everything that is not used anymore by Gnocchi. This daemon is executed at a regular interval specified in metric_cleanup_delay option. By default its value is 5 minutes. It means that every 5 minutes the daemon is executed to check if there are some deleted metrics. A metric is considered as deleted when its status in the indexer is set to deleted.

Once deleted metrics retrieved, MetricJanitor removes all data related to them: unprocessed measures, aggregated measures and the metric itself from the index storage. Unlike MetricProcessor process, MetricJanitor is much simpler since its activity can be summarized in previous 3 points based mainly on a simple delete operations.

Metrics cleaning at data handling

But MetricJanitor is not a single cleaning-oriented component in Gnocchi. Old data is removed also when new measures are handled, more concretely in StorageDriver's _add_measures(aggregation, ap_def, metric, grouped_serie, previous_oldest_mutable_timestamp, oldest_mutable_timestamp) method.

Already existent aggregates are cleaned under 2 conditions: either the storage driver isn't in WRITE_FULL mode and we already have some aggregated measures, or we already have the aggregates and the associated archive policy defines a timespan attribute. Next section shows how the aggregates are truncated for the latter case.

Measures cleaning at this stage consists on reading all split keys registered for given metric and removing the ones that are before the first measure to keep. Not removed keys and the points associated to them are rewritten again because of the compression after the cleaning.

Examples

As promised in previous part, let's see what happens when new measures are added and our metric's archive policy has a timespan attribute defined:

def should_remove_too_old_measures_when_new_measures_are_added(self):
  def create_dirs(dirs_to_create):
    for dir_to_create in dirs_to_create:
      shutil.rmtree(dir_to_create, True)
      try:
        os.makedirs(dir_to_create)
      except OSError as exc:  # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(dir_to_create):
            pass
  dirs_to_create = ['/tmp/incoming30-4/',
                    '/tmp/incoming30-4/00000000-0000-0000-0000-000000000004',
                    '/tmp/00000000-0000-0000-0000-000000000004/',
                    '/tmp/00000000-0000-0000-0000-000000000004/agg_sum'
                    ]
  create_dirs(dirs_to_create)

  in_memory_storage = incoming.file.FileStorage(Conf())
  in_memory_storage.basepath_tmp = '/tmp/'
  indexer = InMemoryIndexer()
  first_metric = indexer.metrics[0]
  # add some measures and process them for the first time
  in_memory_storage.add_measures(first_metric.id, [
      [
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 0), 1.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 1), 3.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 2), 5.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 3), 7.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 4), 9.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 5), 11.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 6), 13.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 7), 15.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 8), 17.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 9), 19.0)
      ]
  ])
  time.sleep(5)

  storage_driver = storage.file.FileStorage(Conf())
  storage_driver.basepath_tmp = '/tmp/'
  storage_driver.process_new_measures(indexer, in_memory_storage, first_metric.id, True)

  measures_first_generation = storage_driver.find_measure(first_metric, lambda item: True,
                                                          numpy.timedelta64(1, 's'), 'sum')

  # add the measures once again and reprocess them
  # check if X old measures were deleted because of the archive policy
  # Here we have only 8 measures so 2 should be kept from the previous generatin
  in_memory_storage.add_measures(first_metric.id, [
      [
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 10), 21.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 11), 23.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 12), 25.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 13), 27.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 14), 29.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 15), 31.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 16), 33.0),
        incoming.Measure(datetime64(2018, 5, 1, 10, 0, 17), 35.0)
      ]
  ])
  time.sleep(5)

  storage_driver.process_new_measures(indexer, in_memory_storage, first_metric.id, True)

  measures_second_generation = storage_driver.find_measure(first_metric, lambda item: True,
                                                            numpy.timedelta64(1, 's'), 'sum')

  stringified_measures_1st_generation = [str(measure_time[0])
                                          for measure_time in measures_first_generation]
  expected_dates_1st_generation = ['2018-05-01T10:00:00.000000000', '2018-05-01T10:00:01.000000000',
    '2018-05-01T10:00:02.000000000', '2018-05-01T10:00:03.000000000',
    '2018-05-01T10:00:04.000000000', '2018-05-01T10:00:05.000000000',
    '2018-05-01T10:00:06.000000000', '2018-05-01T10:00:07.000000000',
    '2018-05-01T10:00:08.000000000', '2018-05-01T10:00:09.000000000']
  for generation_date in expected_dates_1st_generation:
      self.assertIn(generation_date, stringified_measures_1st_generation)
  stringified_measures_2nd_generation = [str(measure_time[0])
                                          for measure_time in measures_second_generation]
  expected_dates_2nd_generation = ['2018-05-01T10:00:08.000000000', '2018-05-01T10:00:09.000000000',
    '2018-05-01T10:00:10.000000000', '2018-05-01T10:00:11.000000000',
    '2018-05-01T10:00:12.000000000', '2018-05-01T10:00:13.000000000',
    '2018-05-01T10:00:14.000000000', '2018-05-01T10:00:15.000000000',
    '2018-05-01T10:00:16.000000000', '2018-05-01T10:00:17.000000000']
  for generation_date in expected_dates_2nd_generation:
      self.assertIn(generation_date, stringified_measures_2nd_generation)

As you can notice, Gnocchi keeps only the number of points respecting defined timespan. Let's see now how behaves MetricJanitor. To do that, we'll run Gnocchi's Docker image and make the following operations:

# in gnocchi-docker/gnocchi/run-gnocchi.sh add this line
# it enables the debug logging level, required to see the activity of MetricJanitor
# printf "[DEFAULT]\ndebug = true\n" >> /etc/gnocchi/gnocchi.conf

# add a metric
curl -d '{
  "archive_policy_name": "high",
  "unit": "%",
  "name": "memory use"
}' -H "Content-Type: application/json"  -X POST http://admin:admin@localhost:8041/v1/metric

# note somewhere its id
# now add the measures
curl -d '[
  {
    "timestamp": "2018-04-26T16:05:00",
    "value": 43.1
  },
  {
    "timestamp": "2018-04-26T16:06:00",
    "value": 50.5
  },
  {
    "timestamp": "2018-04-26T16:07:00",
    "value": 39.6
  }
]' -H "Content-Type: application/json"  -X POST http://admin:admin@localhost:8041/v1/metric/fef512e7-0e35-4dc1-85d0-2ba02cf0970e/measures

# And finally delete the metric soon after the measures are processed
curl  -H "Content-Type: application/json"  -X DELETE http://admin:admin@localhost:8041/v1/metric/fef512e7-0e35-4dc1-85d0-2ba02cf0970e

After executing above operations we should see in the logs the entries proving the activity of MetricJanitor:

gnocchi-metricd_1  | 2018-05-10 17:14:08,709 [35] DEBUG    gnocchi.storage: Deleting metric fef512e7-0e35-4dc1-85d0-2ba02cf0970e
gnocchi-metricd_1  | 2018-05-10 17:14:08,728 [35] DEBUG    gnocchi.cli.metricd: Metrics marked for deletion removed from backend

As proven in this post, Gnocchi precomputes aggregations results but it doesn't keep all aggregates from the beginning of the metric. This property remains configurable and as shown in the second section, Gnocchi will remove the aggregation points below the configured timespan. Moreover, Gnocchi will ensure that all data related to a removed metric is also removed. It does it by asynchronous daemon called MetricJanitor, by default executed every 5 minutes and operating on removed metrics.