Commits · master · tang liyu / ceph-ansible

Jul 09, 2021

update: fail the playbook if straw2 conversion failed · c396122a


It's better to fail the playbook so the user is aware the straw2
migration has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

c396122a

update: followup on pr #6689 · 4eb4268d

Guillaume Abrioux authored 3 years ago


add mising 'osd' command.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

4eb4268d

update: convert straw bucket · eee57647

Guillaume Abrioux authored 3 years ago

After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):

```
crush map has legacy tunables (require firefly, min is hammer)
```

because straw bucket is a firefly feature it needs to be converted to
straw2.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

eee57647

Jul 08, 2021

cephadm-adopt: set application on ganesha pool · aeb9f562

Dimitri Savineau authored 3 years ago

Set the nfs application to the ganesha pool.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1956840



Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

aeb9f562

Jul 07, 2021

dashboard: remove "certificate is valid for" error · 72a0336c

Guillaume Abrioux authored 3 years ago

When deploying dashboard with ssl certificates generated by
ceph-ansible, we enforce the CN to 'ceph-dashboard' which can makes
application such alertmanager complain like following:

`err="Post https://mgr0:8443/api/prometheus_receiver: x509: certificate is valid for ceph-dashboard, not mgr0" context_err="context deadline exceeded"`

The idea here is to add alternative names matching all mgr/mon instances
in the certificate so this error won't appear in logs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978869



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

72a0336c

Jul 06, 2021

workflow: add dashboard playbook to ansible-lint · c5a2239e

Dimitri Savineau authored 3 years ago


The dashboard.yml playbook was missing from the ansible-lint workflow.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

c5a2239e

infra: add playbook to purge dashboard/monitoring · 8e4ef7d6

Dimitri Savineau authored 3 years ago

The dashboard/monitoring stack can be deployed via the dashboard_enabled
variable. But there's nothing similar if we can to remove that part only
and keep the ceph cluster up and running.
The current purge playbooks remove everything.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691



Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

8e4ef7d6

Jul 05, 2021

dashboard: support dedicated network for the dashboard · f4f73b61

Guillaume Abrioux authored 3 years ago

This introduces a new variable `dashboard_network` in order to support
deploying the dashboard on a different subnet.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1927574



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

f4f73b61

ceph-crash: add install checkpoint · 993d06c4

Dimitri Savineau authored 3 years ago


The ceph crash insatll checkpoint callback was missing in the main
playbooks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

993d06c4

Jul 02, 2021

cephadm_adopt: add any_errors_fatal on play · 3b804a61

Guillaume Abrioux authored 3 years ago

Add any_errors_fatal: true in cephadm-adopt playbook.
We should stop the playbook execution when a task throws an error.
Otherwise it can lead to unexpected behavior.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1976179



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

3b804a61

purge: add monitoring group in final cleanup play · 037d8cd0

Guillaume Abrioux authored 3 years ago

This adds the monitoring group in the "final cleanup play" so any cid
files generated are well removed when purging the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1974536



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

037d8cd0

prometheus: fix prometheus target url · 1d568186

Dimitri Savineau authored 3 years ago

The prometheus service isn't binding on localhost.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1933560



Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

1d568186

ceph-facts: move device facts to its own file · d704b05e

Dimitri Savineau authored 4 years ago

Instead of reusing the condition 'inventory_hostname in groups[osds]'
on each device facts tasks then we can move all the tasks into a
dedicated file and set the condition on the import_tasks statement.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

d704b05e

ceph-validate: check logical volumes · 55bca07c

Dimitri Savineau authored 4 years ago

We currently don't check if the logical volume used in lvm_volumes list
for either bluestore data/db/wal or filestore data/journal exist.
We're only doing this on raw devices for batch scenario.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

55bca07c

ceph-validate: check db/journal/wal devices too · 808e7106

Dimitri Savineau authored 4 years ago


When using dedicated devices for db/journal/wal objecstore with
ceph-volume lvm batch then we should also validate that those devices
exist and don't use a gpt partition table in addition of the devices
and lvm_volume.data variables.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

808e7106

ceph-validate: use root device from ansible_mounts · 7e50380f

Dimitri Savineau authored 4 years ago


Instead of using findmnt command to find the device associated to the
root mount point then we can use the ansible_mounts fact.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

7e50380f

ceph-validate: do not resolve devices · 0df99dda

Dimitri Savineau authored 4 years ago


This is already done in the ceph-facts role.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

0df99dda

ceph-validate: check block presence first · 14d458b3

Dimitri Savineau authored 4 years ago


Instead of doing two parted calls we can check first if the device exist
and then test the partition table.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

14d458b3

ceph-validate: check devices from lvm_volumes · ac0342b7

Dimitri Savineau authored 4 years ago

2888c082 introduced a regression as the check_devices tasks file was
only included based on the devices variable.
But that file also validate some devices from the lvm_volumes variable.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1906022



Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

ac0342b7

Jun 30, 2021

container: set tcmalloc value by default · 9758e3c5

Dimitri Savineau authored 3 years ago

All ceph daemons need to have the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES
environment variable set to 128MB by default in container setup.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1970913

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

9758e3c5

rhcs: remove ISO install method · a05730b3

Dimitri Savineau authored 3 years ago


Starting RHCS 5, there's no ISO available anymore.
This removes all ISO variables and the ceph_repository_type variable.

Closes: #6626

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

a05730b3

library: flake8 ceph-ansible modules · beda1fe7

Wong Hoi Sing Edison authored 3 years ago


This commit ensure all ceph-ansible modules pass flake8 properly.

Signed-off-by: Wong Hoi Sing Edison <hswong3i@pantarei-design.com>

beda1fe7

Jun 29, 2021

workflows: test against 1 python version only · d191ba38
Guillaume Abrioux authored 3 years ago
```
Let's drop py3.6 and py3.7

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
```
d191ba38

workflows: add signed-off check · 8c094975

Guillaume Abrioux authored 3 years ago


This adds a github workflow for checking the signed off line in commit
messages.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

8c094975

workflow: add group_vars/defaults checks · d71db816

Guillaume Abrioux authored 3 years ago


let's use github workflow for checking defaults values.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

d71db816

workflow: add syntax check · 5ed423ad

Guillaume Abrioux authored 3 years ago


This adds the ansible --syntax-check test in the ansible-lint workflow

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

5ed423ad

tests: remove legacy file · 304d1cbb

Guillaume Abrioux authored 3 years ago


This inventory isn't used anywhere.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

304d1cbb

shrink-mgr: modify existing mgr check · 26a7256c

Guillaume Abrioux authored 3 years ago

Do not rely on the inventory aliases in order to check if the selected
manager to be removed is present.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967897



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

26a7256c

cephadm-adopt/rgw: add host target in svc_id · 31311b03

Guillaume Abrioux authored 3 years ago

If multi-realms were deployed with several instances belonging to the same
realm and zone using the same port on different nodes, the service id
expected by cephadm will be the same and therefore only one service will
be deployed. We need to create a service called
`<node>.<realm>.<zone>.<port>` to be sure the service name will be unique
and well deployed on the expected node in order to preserve backward
compatibility with the rgws instances that were deployed with
ceph-ansible.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967455



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

31311b03

Jun 28, 2021

switch2container: run ceph-validate role · fc160b3b

Dimitri Savineau authored 3 years ago

This adds the ceph-validate role before starting the switch to a containerized
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1968177



Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>

fc160b3b

Jun 24, 2021

library/ceph_key.py: rewrite for generate_ceph_cmd() · 793d5293
Wong Hoi Sing Edison authored 3 years ago
```
Also code lint with flake8

Signed-off-by: Wong Hoi Sing Edison <hswong3i@pantarei-design.com>
```
793d5293

dashboard: Add new prometheus alert · 2491d4e0

Boris Ranto authored 3 years ago

It was requested for us to update our alerting definitions to include a
slow OSD Ops health check.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1951664



Signed-off-by: Boris Ranto <branto@redhat.com>

2491d4e0

Jun 23, 2021

cephadm-adopt: support rgw multisite adoption · fc784fc4

Guillaume Abrioux authored 3 years ago

We need to support rgw multisite deployments.
This commit makes the adoption playbook support this kind of deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967455



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

fc784fc4

Jun 16, 2021

multisite: fix bug during switch2containers · 8279d14d

Guillaume Abrioux authored 3 years ago

When running the switch-to-containers playbook with multisite enabled,
the fact "rgw_instances" is only set for the node being processed
(serial: 1), the consequence of that is that the set_fact of
'rgw_instances_all' can't iterate over all rgw node in order to look up
each 'rgw_instances_host'.

Adding a condition checking whether hostvars[item]["rgw_instances_host"]
is defined fixes this issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967926



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

8279d14d

tests: Retry generating SSH vagrant config. Also add some debug. · 3eba2a15
David Galloway authored 3 years ago
```
Signed-off-by: David Galloway <dgallowa@redhat.com>
```
3eba2a15

nfs: do no copy client.bootstrap-rgw when using mds · 8dbee998

Guillaume Abrioux authored 3 years ago


There's no need to copy this keyring when using nfs with mds

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

8dbee998

Jun 15, 2021

container: conditionnally disable lvmetad · 38bfad46

Guillaume Abrioux authored 3 years ago

Enabling lvmetad in containerized deployments on el7 based OS might
cause issues.
This commit make it possible to disable this service if needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955040



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

38bfad46

Jun 14, 2021

ceph_key: handle error in a better way · d58500ad

Guillaume Abrioux authored 3 years ago

When calling the `ceph_key` module with `state: info`, if the ceph
command called fails, the actual error is hidden by the module which
makes it pretty difficult to troubleshoot.

The current code always states that if rc is not equal to 0 the keyring
doesn't exist.

`state: info` should always return the actual rc, stdout and stderr.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1964889



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

d58500ad

cephadm-adopt: fix mgr placement hosts task · f9a73149

Guillaume Abrioux authored 3 years ago

When no `[mgrs]` group is defined in the inventory, mgr daemon are
implicitly collocated with monitors.
This task currently relies on the length of the mgr group in order to
tell cephadm to deploy mgr daemons.
If there's no `[mgrs]` group defined in the inventory, it will ask
cephadm to deploy 0 mgr daemon which doesn't make sense and will throw
an error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1970313



Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

f9a73149

tests: allocate more memory for all_in_one job · b49cdea7

Guillaume Abrioux authored 3 years ago


Since we fire up much less VMs than other job, we can affoard allocating
more memory here for this job.
Each VM hosts more daemon so 1024Mb can be too few.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

b49cdea7

Admin message