Prometheus Metrics¶
Nautobot supports optionally exposing native Prometheus metrics from the application. Prometheus is a popular time series metric platform used for monitoring.
Configuration¶
Metrics are not exposed by default. Metric exposition can be toggled with the METRICS_ENABLED configuration setting which exposes metrics at the /metrics HTTP endpoint, e.g. https://nautobot.local/metrics.
In addition to the METRICS_ENABLED setting, database and/or caching metrics can also be enabled by changing the database engine and/or caching backends from django.db.backends / django_redis.cache to django_prometheus.db.backends / django_prometheus.cache.backends.redis:
DATABASES = {
"default": {
# Other settings...
"ENGINE": "django_prometheus.db.backends.postgresql", # use "django_prometheus.db.backends.mysql" with MySQL
}
}
# Other settings...
CACHES = {
"default": {
# Other settings...
"BACKEND": "django_prometheus.cache.backends.redis.RedisCache",
}
}
Added in version 2.2.1
In case the /metrics endpoint is not performant or not required, you can disable specific apps with the METRICS_DISABLED_APPS configuration setting.
For more information see the django-prometheus docs.
Authentication¶
Added in version 2.1.5
Metrics by default do not require authentication to view. Authentication can be toggled with the METRICS_AUTHENTICATION configuration setting. If set to True, this will require the user to be logged in or to use an API token. See REST API Authentication for more details on API authentication.
Sample Telegraf configuration¶
[[inputs.prometheus]]
urls = ["http://localhost/metrics"]
metric_version=2
http_headers = {"Authorization" = "Token 0123456789abcdef0123456789abcdef01234567"}
Metric Types¶
Nautobot makes use of the django-prometheus library to export a number of different types of metrics, including:
- Per model insert, update, and delete counters
- Per view request counters
- Per view request latency histograms
- Request body size histograms
- Response body size histograms
- Response code counters
- Database connection, execution, and error counters
- Cache hit, miss, and invalidation counters
- Django middleware latency histograms
- Other Django related metadata metrics
Additionally, there are a number of metrics custom to Nautobot specifically:
| Name | Description | Type | Exposed By |
|---|---|---|---|
health_check_database_info |
Result of the last database health check | Gauge | Web Server |
health_check_redis_backend_info |
Result of the last redis health check | Gauge | Web Server |
nautobot_app_metrics_processing_ms |
The time it took to collect custom app metrics from all installed apps | Gauge | Web Server |
nautobot_worker_started_jobs |
The amount of jobs that were started | Counter | Worker |
nautobot_worker_finished_jobs |
The amount of jobs that were finished (incl. status label) | Counter | Worker |
nautobot_worker_exception_jobs |
The amount of jobs that ran into an exception (incl. exception type label) | Counter | Worker |
nautobot_worker_singleton_conflict |
The amount of jobs that encountered a closed singleton lock | Counter | Worker |
Note
Due to the multitude of possible deployment scenarios (web server and worker co-hosted on the same machine or not, different possible entrypoint commands for both contexts) some of the metrics exposed for specific components may also be present on the other component. It is up to the operator to account for this when working with the resulting metrics.
These for example give you the option to identify the individual failure/exception rates of specific jobs. Note that all of these metrics are per instance. Thus, you need to do perform aggregations in your visualizations in order to get a complete picture if you are using multiple web servers and/or workers.
For the exhaustive list of exposed metrics, visit the /metrics endpoint on your Nautobot instance. For further information about the different metrics types, see the relevant Prometheus documentation.
Multi Processing Notes¶
When deploying Nautobot in a multi-process manner (e.g. running multiple uWSGI workers) the Prometheus client library requires the use of a shared directory to collect metrics from all worker processes. To configure this, first create or designate a local directory to which the worker processes have read and write access, and then configure your WSGI service (e.g. uWSGI) to define this path as the prometheus_multiproc_dir environment variable.
Since the files stored in the designated directory are not meant to be long-lived, it is recommended to use a temporary directory such as /tmp/nautobot_prometheus or an emptyDir in Kubernetes environments for this purpose. Additionally, in order to avoid scraping delays induced by the processing of orphaned files, this directory must be wiped on a regular basis. In order to avoid removal of files that are still in use, it is recommended to do this before the uWSGI process starts.
Note: the below code snippets are meant to be examples of how to perform the necessary cleanup. The exact implementation may vary based on your specific deployment and operational needs.
Indicatively, you could use the hook-accepting1 uWSGI hook to perform this:
; uwsgi.ini
; Before first worker starts accept request
hook-accepting1 = exec:bash -c 'if [[ $prometheus_multiproc_dir ]]; then rm $prometheus_multiproc_dir/*.db; else echo "No prometheus multi_proc_dir"; fi'
For environments where it's not enough to rely on cleanups based on worker restarts, a more fitting approach is to clean up in a periodic manner, while uWSGI is running. You can use a cron job or similar scheduled task to clean up orphan files, for example:
-
Create a Python script that scans the multiproc directory and removes files belonging to PIDs that are no longer running.
import os import re import shutil import time import uwsgi from prometheus_client import multiprocess # Minimum age of files to consider for cleanup (e.g., 1 hour) MIN_AGE_SECONDS = 3600 def cleanup_orphaned_prom_metric_files(metrics_dir): """ Scans the multiproc directory and removes files belonging to PIDs that are no longer running. """ if not os.path.exists(metrics_dir): return # Pattern to find PIDs in filenames (e.g., gauge_multiproc_123.db) pid_pattern = re.compile(r'.+_(\d+)\.db$') # Get list of currently running PIDs active_pids = set() for pid in os.listdir('/proc'): if pid.isdigit(): active_pids.add(int(pid)) for filename in os.listdir(metrics_dir): match = pid_pattern.match(filename) if match: try: file_pid = int(match.group(1)) except ValueError: continue # If the PID from the file is not in the active PID list # Only consider files older than 1 hour to avoid race conditions file_path = os.path.join(metrics_dir, filename) file_mtime = os.path.getmtime(file_path) file_age_seconds = time.time() - file_mtime if (file_pid not in active_pids) and (file_age_seconds > MIN_AGE_SECONDS): try: # 1. Tell the client to "forget" the process multiprocess.mark_process_dead(file_pid) # 2. Delete the physical file, ignore if it was already removed with suppress(FileNotFoundError): os.remove(file_path) print(f"Cleaned up orphaned metric file: {filename}") except OSError as e: print(f"Error deleting {filename}: {e}") # Schedule this script to run at regular intervals using uWSGI's `timer` feature. def cleanup_timer(signum): cleanup_orphaned_prom_metric_files(os.getenv('prometheus_multiproc_dir')) # Register only on the first worker to avoid multiple workers trying to clean up at the same time if uwsgi.worker_id() == 0: uwsgi.register_signal(99, "", cleanup_timer) uwsgi.add_timer(99, 3600) # this is 1 hour in seconds -
Copy the file to a specific path (eg.
/opt/nautobot/media/prometheus_cleanup.py) and import it from uwsgi.ini file.
The implementation described above is a mere example. The same functionality can be achieved with a different approach, for example by using a separate script that is executed by a cron job or similar scheduled task instead of using uWSGI's timer feature. Another interesting approach would be to introduce uWSGI mules (documentation) to avoid interrupting the main uwsgi process. The important part is to ensure that the cleanup process is running at regular intervals to prevent the accumulation of orphaned metric files.
Relevant documentation: