Tracing
job scheduling
The following trace extract is an abbreviated
trace for the execution of a single job:
|
1
Wed Dec 29 15:54:00 2004
[AGENTEXEC 1160] do_exec: execquery,s0,F,SELECT ALL r_object_id,
a_next_invocation FROM dm_job WHERE ( (run_now = 1) OR
((is_inactive = 0) AND ( ( a_next_invocation <= DATE('now')
AND a_next_invocation IS NOT NULLDATE ) OR ( a_next_continuation
<= DATE('now') AND a_next_continuation IS NOT NULLDATE
) ) AND ((expiration_date > DATE('now')) OR (expiration_date
IS NULLDATE)) AND ((max_iterations = 0) OR (a_iterations
< max_iterations))) ) AND (i_is_reference = 0 OR i_is_reference
is NULL) AND (i_is_replica = 0 OR i_is_replica is NULL)
ORDER BY a_next_invocation, r_object_id
2 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: getlastcoll,s0
3 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_next: next,s0,q0
4 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: get,s0,q0,r_object_id
5 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get_attr (result):
08053cc280000208
6 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_exec: fetch,s0,08053cc280000208,dm_job
7 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: get,s0,08053cc280000208,object_name
8 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get_attr (result):
dm_FulltextMgr
9 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] Processing Job:
dm_FulltextMgr
<retrieve attributes method_name,max_iterations,run_interval,run_mode,etc.>
10 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: get,s0,08053cc280000208,target_server
11
Wed Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get_attr
(result): dm2.dm2@xenselap1
12 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: get,s0,serverconfig,object_name
13 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get_attr (result):
dm2
14 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: get,s0,serverconfig,r_host_name
15 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get_attr (result):
xenselap1
16 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] check_target_server:
dm2.dm2@xenselap1
17 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] current_docbase:
dm2
18 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] current_config:
dm2
19 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] current_machine:
xenselap1
20 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: get,s0,08053cc280000208,r_lock_owner
21 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get_attr (result):
22 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: checkout,s0,08053cc280000208,,,agentexec
<calculate job output file name>
23 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] Output File Name:
G:\Documentum\dba\log\00053cc2\agentexec\job_08053cc280000208
24 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] File is not locked.
<a_next_invocation calculation>
25 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_set: set,s0,08053cc280000208,a_next_invocation
26 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_set (value):
Wed Dec 29 15:55:00 2004
27 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_set: set,s0,08053cc280000208,a_next_continuation
28 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_set (value):
nulldate
29 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_exec: save,s0,08053cc280000208,T
30 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] start_method_launcher
31 Wed
Dec 29 15:54:01 2004 [LAUNCHER 808] set_trace_header
32 Wed
Dec 29 15:54:01 2004 [LAUNCHER 808] do_launcher_main
33 Wed
Dec 29 15:54:01 2004 [LAUNCHER 808] do_get: connect,dm2.dm2,dmadmin,'',,,try_native_first
34 Wed
Dec 29 15:54:01 2004 [LAUNCHER 808] do_get: get,s0,serverconfig,object_name
35 Wed
Dec 29 15:54:01 2004 [LAUNCHER 808] do_get_attr (result):
dm2
36 Wed
Dec 29 15:54:01 2004 [LAUNCHER 808] add_msg_param: dm2
37 Wed
Dec 29 15:54:01 2004 [LAUNCHER 808] add_msg_param: Wed
Dec 29 15:54:01 2004 [INFORMATION] [LAUNCHER 808] Detected
while preparing job dm_FulltextMgr for execution: Agent
Exec connected to server dm2:
38 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] ARG[0] = dm_agent_exec.exe
39 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] CreateProcess dm_agent_exec.exe
40 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] ARG[1] = -docbase_name
dm2.dm2
41 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] ARG[2] = -docbase_owner
dmadmin
42 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] ARG[3] = -trace_level
1
43 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] ARG[4] = -job_id
08053cc280000208
44 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] ARG[5] = -parent_handle
548
45 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] do_sleep: 30
|
|
|
Lines 1 to 3 are the job scheduler DQL query again but this
time round we have at least one job to run since lines 4 and
5 show that we have retrieved the r_object_id of a runnable
job, in fact it is the dm_FulltextMgr job as shown in lines
7,8 and 9.
I have omitted a whole section
of trace showing the attributes on the dm_job object being retrieved
and validated, however I have chosen to retain lines 10 to 19.
These lines show the checks that the job scheduler makes to
ensure that the target_server attribute on the retrieved job
object is compatible with the server config and host machine
for the current connection. This is probably the single most
common cause for jobs not running particularly in a multi-Content
Server environment where there will be multiple server configs/host
machines. It is also a common problem when restoring a docbase
from backup to a different machine - everything will work fine
except jobs 'mysteriously' fail to run. Every dm_job object
will have a target_server attribute that is incompatible with
the current host machine context. Simply run some DQL after
the restore to update the target_server attribute to its correct
value.
Lines 30 to 44 show the creation of the
process to run the job. During process creation dm_agent_exec
forks a new dm_agent_exec process to handle the execution
of the job's method. This happens even on Windows systems
where the more usual software design is to use an additional
thread in the same process. The new process connects to the
docbase and then executes the method. Line 30 is the starting
point for the process creation in the parent agentexec, lines
31 to 37 are the trace output from the child agentexec (notice
the different text within the square brackets, the number
is the OS ID for the process). When the method has been invoked
the child agentexec remains in existence until the method
completes and then it too exits, the parent agentexec sleeps
for 30 seconds (line 45) and then continues processing the
next job from the job scheduler query.
The next extract is a highly edited
trace from a system where 4 jobs have been scheduled to run
at the same time:
|
1 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_exec: execquery,s0,F,SELECT
ALL r_object_id, a_next_invocation FROM dm_job WHERE
( (run_now = 1) OR ((is_inactive = 0) AND ( ( a_next_invocation
<= DATE('now') AND a_next_invocation IS NOT NULLDATE
) OR ( a_next_continuation <= DATE('now') AND a_next_continuation
IS NOT NULLDATE ) ) AND ((expiration_date > DATE('now'))
OR (expiration_date IS NULLDATE)) AND ((max_iterations
= 0) OR (a_iterations < max_iterations))) ) AND (i_is_reference
= 0 OR i_is_reference is NULL) AND (i_is_replica = 0
OR i_is_replica is NULL) ORDER BY a_next_invocation,
r_object_id
2 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: getlastcoll,s0
3 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_next: next,s0,q0
4 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get: get,s0,q0,r_object_id}
5 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] do_get_attr (result):
08053cc280000208
6 Wed
Dec 29 15:54:00 2004 [AGENTEXEC 1160] Processing Job:
dm_FulltextMgr
7
Wed Dec 29 15:54:01 2004 [AGENTEXEC 1160] CreateProcess
dm_agent_exec.exe
8 Wed
Dec 29 15:54:01 2004 [AGENTEXEC 1160] do_sleep: 30
9 Wed
Dec 29 15:54:31 2004 [AGENTEXEC 1160] do_next: next,s0,q0
Wed Dec 29 15:54:31 2004 [AGENTEXEC 1160] do_get: get,s0,q0,r_object_id
10 Wed
Dec 29 15:54:31 2004 [AGENTEXEC 1160] do_get_attr (result):
08053cc28000022f
11 Wed
Dec 29 15:54:31 2004 [AGENTEXEC 1160] Processing Job:
dm_DataDictionaryPublisher
12 Wed
Dec 29 15:54:31 2004 [AGENTEXEC 1160] CreateProcess
dm_agent_exec.exe
13 Wed
Dec 29 15:54:31 2004 [AGENTEXEC 1160] do_sleep: 30
14 Wed
Dec 29 15:55:01 2004 [AGENTEXEC 1160] do_next: next,s0,q0
15 Wed
Dec 29 15:55:01 2004 [AGENTEXEC 1160] do_get: get,s0,q0,r_object_id
16 Wed
Dec 29 15:55:01 2004 [AGENTEXEC 1160] do_get_attr (result):
08053cc280000231
17 Wed
Dec 29 15:55:01 2004 [AGENTEXEC 1160] Processing Job:
dm_CleanFTIndex
18 Wed
Dec 29 15:55:01 2004 [AGENTEXEC 1160] CreateProcess
dm_agent_exec.exe
19 Wed Dec 29
15:55:01 2004 [AGENTEXEC 1160] do_sleep: 30
20 Wed
Dec 29 15:55:31 2004 [AGENTEXEC 1160] do_exec: close,s0,q0
21 Wed
Dec 29 15:55:31 2004 [AGENTEXEC 1160] do_sleep: 60
22 Wed
Dec 29 15:56:31 2004 [AGENTEXEC 1160] do_exec: execquery,s0,F,SELECT
ALL r_object_id, a_next_invocation FROM dm_job WHERE
( (run_now = 1) OR ((is_inactive = 0) AND ( ( a_next_invocation
<= DATE('now') AND a_next_invocation IS NOT NULLDATE
) OR ( a_next_continuation <= DATE('now') AND a_next_continuation
IS NOT NULLDATE ) ) AND ((expiration_date > DATE('now'))
OR (expiration_date IS NULLDATE)) AND ((max_iterations
= 0) OR (a_iterations < max_iterations))) ) AND (i_is_reference
= 0 OR i_is_reference is NULL) AND (i_is_replica = 0
OR i_is_replica is NULL) ORDER BY a_next_invocation,
r_object_id
23 Wed
Dec 29 15:56:31 2004 [AGENTEXEC 1160] do_get: getlastcoll,s0
24 Wed
Dec 29 15:56:31 2004 [AGENTEXEC 1160] do_next: next,s0,q0
25
Wed Dec 29 15:56:31 2004 [AGENTEXEC 1160] do_get: get,s0,q0,r_object_id
26 Wed
Dec 29 15:56:31 2004 [AGENTEXEC 1160] do_get_attr (result):
08053cc280000210
27 Wed
Dec 29 15:56:31 2004 [AGENTEXEC 1160] Processing Job:
dm_StateOfDocbase
28 Wed
Dec 29 15:56:31 2004 [AGENTEXEC 1160] CreateProcess
dm_agent_exec.exe
29 Wed
Dec 29 15:56:31 2004 [AGENTEXEC 1160] do_sleep: 30
30 Wed
Dec 29 15:57:01 2004 [AGENTEXEC 1160] do_next: next,s0,q0
31 Wed
Dec 29 15:57:01 2004 [AGENTEXEC 1160] do_exec: close,s0,q0
32 Wed
Dec 29 15:57:01 2004 [AGENTEXEC 1160] do_sleep: 60
|
|
|
Line 1 shows the execution of the job
scheduler query, lines 3 to 7 show the selection and execution
of a job (dm_FulltextMgr) followed by the 30 second sleep (line
8). Lines 9-13 (notice the timestamps have advanced by 30 seconds)
show the selection and execution of the second job (dm_DataDictionaryPublisher)
and a further 30 second sleep. The 3rd job (dm_CleanFTIndex)
is executed in lines 14-19. However when agentexec wakes up
from the 3rd 30 second sleep the query collection is closed
(line 20), even though we know there is a 4th job to run! What
is going on?
Recall from the Content Server Administrators
guide there is a parameter -max_concurrent_jobs that can be
set for agentexec. The default is 3, which means agentexec
runs the job scheduler query and executes up to 3 jobs before
going to sleep and then starting over. We can see this in
lines 21 and 22 - agentexec sleeps for 60 seconds then runs
the job scheduler query and executes the 4th job (dm_StateOfDocbase)
in lines 23 to 28.
Note that the sleep period between query
executions is 60 seconds by default and is controllable by
the -override_sleep_duration agentexec parameter. The sleep
period between processing of rows of the job scheduler query
is 30 seconds, this duration cannot be altered.
This leads to 2 observations. First, there
is a commonly held belief that the 'job scheduler polls every
5 minutes'. I have heard this during numerous site visits
and have no idea where it originates from. Clearly this is
not the case, if there are no other jobs executing the worse
case latency is 60 seconds (plus a few seconds for startup
time) when the job scheduler has just gone to sleep after
processing the job scheduler query. I guess it probably seems
like a long time when you are waiting for the job to start.
Secondly, even if you reduce the inter-query
sleep time using -override_sleep_duration parameter there
is a built in 30 second latency between job executions. Don't
count on application designs that involve using large numbers
of frequently executing jobs. next
|