Fun with attribute STOP_TIMEOUT for a custom clusterware resource

alex-guillaume-769172-unsplash

(Photo by Alex Guillaume, on Unsplash)

I recently encountered a problem, for which I do not have any clue yet. But at least, I have a workaround. The goal of this blog post is to remember the exploration towards this workaround. And then to switch back to a normal sitution when possible.

For some reason, 120 development databases were configured to use Shared Server Architecture. The day after this change of configuration, a lot of users started complaining about a fully-automatized-0-problem-encountered-in-the-last-2-years-procedure to duplicate a production database to development database … Indeed, this procedure begins with stopping target database, and this day, failed almost everytime during this step … Why ?

What are the symptoms ?

This environment is an Oracle 12.1 Grid Infrastructure cold failover cluster with 2 nodes, thus all databases are resources created with a custom resource type called db_12c.type, belonging to the “cluster_resource” category. All of these databases are managed using an action script to control the way they are started, stopped and cleaned. In this specific case, the stop of these databases using clusterware failed :

CRS-2673: Attempting to stop 'tflora2t.db' on 'myserver1'
CRS-2675: Stop of 'tflora2t.db' on 'myserver1' failed

And all alert logs related to these databases contained the same info :

Waiting for dispatcher 'D000' to shutdown

In all cases, the whole shutdown process took more than 2 minutes, and most of those 2 minutes were spent waiting for this dispatcher to shutdown.
Since I am very new to the concept of Shared Server Architecture, I don’t know why it took so long (please contact me if you have any clue). But I needed to fix the situation to stop the anger of the users 🙂

A specific clusterware log file $ORACLE_BASE/diag/crs/myserver1/crs/trace/crsd_scriptagent_oracle_30708.trc contained an interesing information :

2019-02-11 23:16:15.828125 : AGFW:125810432: {2:1614:50126} Command: stop for resource: tflora2t.db 1 1 completed with status: TIMEDOUT 

Ok, there is something wrong with a timeout. I checked all attributes containing timeout for one of these databases :

# crsctl status res tflora2t.db -f | grep -i timeout

ACTION_TIMEOUT=60
CHECK_TIMEOUT=0
CLEAN_TIMEOUT=60
DELETE_TIMEOUT=60
INTERMEDIATE_TIMEOUT=0
MODIFY_TIMEOUT=60
SCRIPT_TIMEOUT=60
START_TIMEOUT=0
STOP_TIMEOUT=0

Which “TIMEOUT” attribute is the right one ?

I gathered valuable information in the documentation to figure out which attribute I would have to modify. To sum up, all the attributes CHECK_TIMEOUT, DELETE_TIMEOUT, MODIFY_TIMEOUT, START_TIMEOUT, STOP_TIMEOUT, if they are not specified, are equal to SCRIPT_TIMEOUT. And the description of SCRIPT_TIMEOUT is : The maximum time (in seconds) for an action to run. Oracle Clusterware returns an error message if the action script does not complete within the time specified. The timeout applies to all actions (start, stop, check, and clean).

I was about to increase SCRIPT_TIMEOUT, but STOP_TIMEOUT seemed more revelant, since the current issue happens when the database is stopped via clusterware. Changing SCRIPT_TIMEOUT would also have an impact on timeout related to start, stop, check, and clean actions, and it is not necessary.

What are the details related to this attribute in my custom type ?

# crsctl status type db_12c.type -f | grep -A 3 STOP_TIMEOUT
ATTRIBUTE=STOP_TIMEOUT
DEFAULT_VALUE=0
TYPE=INT
FLAGS=CONFIG HOTMOD

What do these flags mean ?

HOTMOD : If you change the value of an attribute for resources of this type, then the changes are applied immediately with the need to restart the resource.
This sentence sounds strange to me : is it really “with the need to restart the resource” ? Or should it be “without” instead ?
CONFIG: After you register a resource with this resource type, you can configure the attribute.
Good, this is what I want. The opposite flag would have been READONLY I guess.

Is “STOP_TIMEOUT” the attribute I am looking for ?

To make sure, I ran a quick test in which I deliberately set STOP_TIMEOUT to 5 on a resource, which is a way too low value :

# crsctl modify res tflora2t.db -attr "STOP_TIMEOUT=5"

Then I stopped this resource, and after 5 seconds, I got the following error :

# crsctl stop res tflora2t.db
CRS-2673: Attempting to stop 'tflora2t.db' on 'myserver1'
CRS-2675: Stop of 'tflora2t.db' on 'myserver1' failed
CRS-2679: Attempting to clean 'tflora2t.db' on 'myserver1'
CRS-2678: 'tflora2t.db' on 'myserver1' has experienced an unrecoverable failure
CRS-4000: Command Stop failed, or completed with errors.

How to apply the change on all resources ?

First I checked the initial situation for all resources :

# crsctl status res -w "TYPE co db_12" -f | grep STOP_TIMEOUT | sort | uniq
STOP_TIMEOUT=0

All of them have STOP_TIMEOUT=0.

At first, I thought changing STOP_TIMEOUT on a specific type would propagate to all resources of this type, so I gave it a try :

# crsctl modify type db_12c.type -attr "ATTRIBUTE=STOP_TIMEOUT,TYPE=INT,DEFAULT_VALUE=180"

… but the resources do not inherit this modification. I then modified the value of attribute STOP_TIMEOUT on all resources :

# for i in `crsctl status res -w "TYPE co db_12" | grep NAME | cut -d"=" -f2`; do crsctl modify res $i -attr "STOP_TIMEOUT=180"; done

At least, the value is now modified on the custom type db_12c.type, so future resources created with this type will inherit the correct configuration.

At the end, all the corresponding resources have the same STOP_TIMEOUT :

# crsctl status res -w "TYPE co db_12" -f | grep STOP_TIMEOUT | sort | uniq
STOP_TIMEOUT=180

Great !

Bonus – What could go wrong ?

A few years ago, I had a very very very bad surprise (and a bug number : 26261671). I created a new type with an attribute I was not supposed to use (CONFIG_VERSION). After the creation of the type, the output was OK (I did not get any error), so I considered everything was OK. But unfortunately, the OCR got silently corrupted, and when running ocrcheck with root, everything appeared to be fine too. The corruption was discovered because all OCR backups (automatic and manual) stopped working with the following output :

# sudo $ORACLE_HOME/bin/ocrconfig -manualbackup
PROT-23: Failed to backup OCR
PROC-60: The Oracle Cluster Registry backup file '$ORACLE_HOME/cdata/clu-name/backup_20170529_134254.ocr' is corrupt.

So in this case, after having modify my custom type, I checked the integrity of the OCR :

# sudo $ORACLE_HOME/bin/ocrcheck
Status of Oracle Cluster Registry is as follows :
[...]

Cluster registry integrity check succeeded

Logical corruption check succeeded

… and ran a manual backup, just to make sure everything was OK.

# sudo $ORACLE_HOME/bin/ocrconfig -manualbackup

myserver1 2019/02/12 16:46:13 $ORACLE_HOME/cdata/clu-name/backup_20190212_164613.ocr 1486476887

[...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s