Metadata-Version: 2.1
Name: colmet
Version: 0.6.8
Summary: A utility to monitor the jobs ressources in a HPC environment, espacially OAR
Home-page: http://oar.imag.fr/
Author: Philippe Le Brouster, Olivier Richard
Author-email: philippe.le-brouster@imag.fr, olivier.richard@imag.fr
Maintainer: Salem Harrache
Maintainer-email: salem.harrache@inria.fr
License: GNU GPL
Keywords: monitoring,taskstat,oar,hpc,sciences
Platform: Linux
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: System :: Clustering
Classifier: Programming Language :: Python :: 3.5
Description-Content-Type: text/markdown
License-File: LICENSE

# Colmet - Collecting metrics about jobs running in a distributed environnement

## Introduction:

Colmet is a monitoring tool to collect metrics about jobs running in a
distributed environnement, especially for gathering metrics on clusters and
grids. It provides currently several backends :
- Input backends:
  - taskstats: fetch task metrics from the linux kernel
  - rapl: intel processors realtime consumption metrics
  - perfhw: perf_event counters
  - jobproc: get infos from /proc
  - ipmipower: get power metrics from ipmi
  - temperature: get temperatures from /sys/class/thermal
  - infiniband: get infiniband/omnipath network metrics
  - lustre: get lustre FS stats
- Output backends:
  - elasticsearch: store the metrics on elasticsearch indexes
  - hdf5: store the metrics on the filesystem
  - stdout: display the metrics on the terminal

It uses zeromq to transport the metrics across the network.

It is currently bound to the [OAR](http://oar.imag.fr) RJMS.

A Grafana [sample dashboard](./graph/grafana) is provided for the elasticsearch backend. Here are some snapshots:

![](./screenshot1.png)

![](./screenshot2.png)

## Installation:

### Requirements

- a Linux kernel that supports
  - Taskstats
  - intel_rapl (for RAPL backend)
  - perf_event (for perfhw backend)
  - ipmi_devintf (for ipmi backend)

- Python Version 2.7 or newer
  - python-zmq 2.2.0 or newer
  - python-tables 3.3.0 or newer
  - python-pyinotify 0.9.3-2 or newer
  - python-requests

- For the Elasticsearch output backend (recommended for sites with > 50 nodes)
  - An Elasticsearch server
  - A Grafana server (for visu)

- For the RAPL input backend:
  - libpowercap, powercap-utils (https://github.com/powercap/powercap)

- For the infiniband backend:
  - `perfquery` command line tool

- for the ipmipower backend:
  - `ipmi-oem` command line tool (freeipmi) or other configurable command

### Installation

You can install, upgrade, uninstall colmet with these commands::

```
$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet
```

Or from git (last development version)::

```
$ pip install [--user] git+https://github.com/oar-team/colmet.git
```

Or if you already pulled the sources::

```
$ pip install [--user] path/to/sources
```

### Usage:

for the nodes :

```
sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556
```

for the collector :

```
# Simple local HDF5 file collect:
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
```

```
# Collector with an Elasticsearch backend:
  colmet-collector -vvv \
    --zeromq-bind-uri tcp://192.168.0.1:5556 \
    --buffer-size 5000 \
    --sample-period 3 \
    --elastic-host http://192.168.0.2:9200 \
    --elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log
```

You will see the number of counters retrieved in the debug log.


For more information, please refer to the help of theses scripts (`--help`)

### Notes about backends

Some input backends may need external libraries that need to be previously compiled and installed:

```
# For the perfhw backend:
cd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/
# For the rapl backend:
cd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/
```

Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:

```
export LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so
export LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so

colmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \
   --cpuset_rootpath /dev/cpuset/oar \
   --enable-infiniband --omnipath \
   --enable-lustre \
   --enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
   --enable-RAPL \
   --enable-jobproc \
   --enable-ipmipower >> /var/log/colmet.log 2>&1
```

#### RAPL - Running Average Power Limit (Intel)

RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.

Usage : start colmet-node with option `--enable-RAPL`

A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.

If a given counter is not supported by harware the metric name will be "`counter_not_supported_by_hardware`" and `0` values will appear in the collected data; `-1` values in the collected data means there is no counter mapped to the column.

#### Perfhw

This provides metrics collected using  interface [perf_event_open](http://man7.org/linux/man-pages/man2/perf_event_open.2.html).

Usage : start colmet-node with option `--enable-perfhw`

Optionnaly choose the metrics you want (max 5 metrics) using options `--perfhw-list` followed by space-separated list of the metrics/

Example : `--enable-perfhw --perfhw-list instructions cpu_cycles cache_misses`

A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.

Available metrics (refers to perf_event_open documentation for signification) :

```
cpu_cycles 
instructions 
cache_references 
cache_misses 
branch_instructions
branch_misses
bus_cycles 
ref_cpu_cycles 
cache_l1d 
cache_ll
cache_dtlb 
cache_itlb 
cache_bpu 
cache_node 
cache_op_read 
cache_op_prefetch 
cache_result_access 
cpu_clock 
task_clock 
page_faults 
context_switches 
cpu_migrations
page_faults_min
page_faults_maj
alignment_faults 
emulation_faults
dummy
bpf_output
```

#### Temperature

This backend gets temperatures from `/sys/class/thermal/thermal_zone*/temp`

Usage : start colmet-node with option `--enable-temperature`

A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.



Colmet CHANGELOG
================

Version 0.6.8
-------------
- Added nvidia GPU support

Version 0.6.7
-------------
- bugfix: glob import missing into procstats

Version 0.6.6
-------------
- Added --no-check-certificates option for elastic backend
- Added involved jobs and new metrics into jobprocstats

Version 0.6.4
-------------

- Added http auth support for elasticsearch backend


Version 0.6.3
-------------

Released on September 4th 2020

- Bugfixes into lustrestats and jobprocstats backend

Version 0.6.2
-------------

Released on September 3rd 2020

- Python package fix

Version 0.6.1
-------------

Released on September 3rd 2020

- New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc
- New ouptut backend: elasticsearch
- Example Grafana dashboard for Elasticsearch backend
- Added "involved_jobs" value for metrics that are global to a node (job 0)
- Bugfix for "dictionnary changed size during iteration"

Version 0.5.4
-------------

Released on January 19th 2018

- hdf5 extractor script for OAR RESTFUL API
- Added infiniband backend
- Added lustre backend
- Fixed cpuset_rootpath default always appended

Version 0.5.3
-------------

Released on April 29th 2015

- Removed unnecessary lock from the collector to avoid colmet to wait forever
- Removed (async) zmq eventloop and added ``--sample-period`` to the collector.
- Fixed some bugs about hdf file

Version 0.5.2
-------------

Released on Apr 2nd 2015

- Fixed python syntax error


Version 0.5.1
-------------

Released on Apr 2nd 2015

- Fixed error about missing ``requirements.txt`` file in the sdist package


Version 0.5.0
-------------

Released on Apr 2nd 2015

- Don't run colmet as a daemon anymore
- Maintained compatibility with zmq 3.x/4.x
 - Dropped ``--zeromq-swap`` (swap was dropped from zmq 3.x)
 - Handled zmq name change from HWM to SNDHWM and RCVHWM
- Fixed requirements
- Dropped python 2.6 support

Version 0.4.0
-------------

- Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
- Handled HUP signal to reload ``colmet-collector``
- Removed ``hiwater_rss`` and ``hiwater_vm`` collected metrics.


Version 0.3.1
-------------

- New metrics ``hiwater_rss`` and ``hiwater_vm`` for taskstats
- Worked with pyinotify 0.8
- Added ``--disable-procstats`` option to disable procstats backend.


Version 0.3.0
-------------

- Divided colmet package into three parts

  - colmet-node : Retrieve data from taskstats and procstats and send to
    collectors with ZeroMQ
  - colmet-collector : A collector that stores data received by ZeroMQ in a
    hdf5 file
  - colmet-common : Common colmet part.
- Added some parameters of ZeroMQ backend to prevent a memory overflow
- Simplified the command line interface
- Dropped rrd backend because it is not yet working
- Added ``--buffer-size`` option for collector to define the maximum number of
  counters that colmet should queue in memory before pushing it to output
  backend
- Handled SIGTERM and SIGINT to terminate colmet properly

Version 0.2.0
-------------

- Added options to enable hdf5 compression
- Support for multiple job by cgroup path scanning
- Used Inotify events for job list update
- Don't filter packets if no job_id range was specified, especially with zeromq
  backend
- Waited the cgroup_path folder creation before scanning the list of jobs
- Added procstat for node monitoring through fictive job with 0 as identifier
- Used absolute time take measure and not delay between measure, to avoid the
  drift of measure time
- Added workaround when a newly cgroup is created without process in it
  (monitoring is suspended upto one process is launched)


Version 0.0.1
-------------

- Conception
