Skip to content

Commit 7a5e561

Browse files
committed
high cpu utilization on elastic clusters content
1 parent e5997f9 commit 7a5e561

4 files changed

Lines changed: 340 additions & 0 deletions

File tree

articles/postgresql/TOC.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -673,6 +673,9 @@
673673
- name: Troubleshoot high CPU utilization
674674
href: troubleshoot/how-to-high-cpu-utilization.md
675675
displayName: High CPU Utilization
676+
- name: Troubleshoot high CPU utilization on elastic clusters
677+
href: troubleshoot/how-to-high-cpu-utilization-elastic-clusters.md
678+
displayName: High CPU Utilization on Elastic Clusters
676679
- name: Troubleshoot high memory utilization
677680
href: troubleshoot/how-to-high-memory-utilization.md
678681
displayName: High Memory Utilization
Lines changed: 337 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,337 @@
1+
---
2+
title: High CPU Utilization Across Elastic Clusters
3+
description: Troubleshoot high CPU utilization across Azure PostgreSQL elastic clusters.
4+
author: gapaderla
5+
ms.author: gapaderla
6+
ms.reviewer: jaredmeade
7+
ms.date: 01/28/2026
8+
ms.service: azure-database-postgresql
9+
ms.subservice: flexible-server
10+
ms.topic: troubleshooting-elastic-clusters
11+
---
12+
13+
# Troubleshoot High CPU Utilization in Azure Database for PostgreSQL Elastic Clusters
14+
15+
This article describes how to identify the root cause of high CPU utilization. It also provides possible remedial actions to control CPU utilization when using [Elastic clusters in Azure Database for PostgreSQL](concepts-elastic-clusters.md).
16+
17+
In this article, you learn about:
18+
19+
- How to use tools like Azure Metrics, pg_stat_statements, citus_stat_activity, and pg_stat_activity to identify high CPU utilization.
20+
- How to identify root causes, such as long running queries and total connections
21+
- How to resolve high CPU utilization by using EXPLAIN ANALYZE and vacuuming tables.
22+
23+
## Tools to Identify High CPU Utilization
24+
25+
Consider the use of the following list of tools to identify high CPU utilization:
26+
27+
### Azure Metrics
28+
29+
Azure Metrics is a good starting point to check the CPU utilization for a specific period. Metrics provide information about the resources utilized during the period in which you are monitoring. You can use the **Apply splitting** option and **Split by Server Name** to view the details of each individual node in your elastic cluster. You can then compare the performance of **Write IOPs, Read IOPs, Read Throughput Bytes/Sec**, and **Write Throughput Bytes/Sec** with **CPU percent**, to view the performance of individual nodes when you observe your workload consuming high CPU.
30+
31+
Once you have identified a particular node (or nodes) with higher than expected CPU utilization, you can connect directly to one more nodes in question and perform a more in-depth analysis using the following Postgres tools:
32+
33+
### pg_stat_statements
34+
35+
The `pg_stat_statements` extension helps identify queries that consume time on the server. For more information about this extension, see the detailed [documentation](https://www.postgresql.org/docs/current/pgstatstatements.html).
36+
37+
#### Calls/Mean & Total Execution Time
38+
39+
The following query returns the top five SQL statements by highest total execution time:
40+
41+
```sql
42+
SELECT userid::regrole, dbid, query, total_exec_time, mean_exec_time, calls
43+
FROM pg_stat_statements
44+
ORDER BY total_exec_time
45+
DESC LIMIT 5;
46+
```
47+
48+
### pg_stat_activity
49+
50+
The `pg_stat_activity` view shows the queries that are currently being executed on the specific node. Monitor active queries, sessions, and states on that node.
51+
52+
```sql
53+
SELECT *, now() - xact_start AS duration
54+
FROM pg_stat_activity
55+
WHERE state IN ('idle in transaction', 'active') AND pid <> pg_backend_pid()
56+
ORDER BY duration DESC;
57+
```
58+
59+
### citus_stat_activity
60+
61+
The `citus_stat_activity` view shows the distributed queries that are executing on all nodes, and is a superset of `pg_stat_activity`. This view also shows tasks specific to subqueries dispatched to workers, task state, and worker nodes.
62+
63+
```sql
64+
SELECT *, now() - xact_start AS duration
65+
FROM citus_stat_activity
66+
WHERE state IN ('idle in transaction', 'active') AND pid <> pg_backend_pid()
67+
ORDER BY duration DESC;
68+
```
69+
70+
## Identify Root Causes
71+
72+
If CPU consumption levels are high in general, the following scenarios could be possible root causes:
73+
74+
### Long-running transactions on specific node
75+
76+
Long-running transactions can consume CPU resources that lead to high CPU utilization.
77+
78+
The following query provides information on long-running transactions:
79+
80+
```sql
81+
SELECT
82+
pid,
83+
datname,
84+
usename,
85+
application_name,
86+
client_addr,
87+
backend_start,
88+
query_start,
89+
now() - query_start AS duration,
90+
state,
91+
wait_event,
92+
wait_event_type,
93+
query
94+
FROM pg_stat_activity
95+
WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
96+
ORDER BY now() - query_start DESC;
97+
```
98+
99+
### Long-running transactions on all nodes
100+
101+
Long-running transactions can consume CPU resources that lead to high CPU utilization.
102+
103+
The following query provides information on long-running transactions across all nodes:
104+
105+
```sql
106+
SELECT
107+
global_pid, pid,
108+
nodeid,
109+
datname,
110+
usename,
111+
application_name,
112+
client_addr,
113+
backend_start,
114+
query_start,
115+
now() - query_start AS duration,
116+
state,
117+
wait_event,
118+
wait_event_type,
119+
query
120+
FROM citus_stat_activity
121+
WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
122+
ORDER BY now() - query_start DESC;
123+
```
124+
125+
### Slow query
126+
127+
Slow queries can consume CPU resources that lead to high CPU utilization.
128+
129+
The following query helps identify queries taking longer run times:
130+
131+
```sql
132+
SELECT
133+
query,
134+
calls,
135+
mean_exec_time,
136+
total_exec_time,
137+
rows,
138+
shared_blks_hit,
139+
shared_blks_read,
140+
shared_blks_dirtied,
141+
shared_blks_written,
142+
temp_blks_read,
143+
temp_blks_written,
144+
wal_records,
145+
wal_fpi,
146+
wal_bytes
147+
FROM pg_stat_statements
148+
WHERE query ILIKE '%select%' OR query ILIKE '%insert%' OR query ILIKE '%update%' OR query ILIKE '%delete%' OR queryid = <queryid>
149+
ORDER BY total_exec_time DESC;
150+
```
151+
152+
### Total number of connections and number of connections by state on a node
153+
154+
Many connections to the database might also lead to increased CPU utilization.
155+
156+
The following query provides information about the number of connections by state on a single node:
157+
158+
```sql
159+
SELECT state, COUNT(*)
160+
FROM pg_stat_activity
161+
WHERE pid <> pg_backend_pid()
162+
GROUP BY state
163+
ORDER BY state ASC;
164+
```
165+
166+
### Total number of connections and number of connections by state on all nodes
167+
168+
Many connections to the database might also lead to increased CPU utilization.
169+
170+
The following query gives information about the number of connections by state across all nodes:
171+
172+
```sql
173+
SELECT state, COUNT(*)
174+
FROM citus_stat_activity
175+
WHERE pid <> pg_backend_pid()
176+
GROUP BY state
177+
ORDER BY state ASC;
178+
```
179+
180+
### Vacuum and Table Stats
181+
182+
Keeping table statistics up to date helps improve query performance. Monitor whether regular auto vacuuming is being carried out.
183+
184+
The following query helps to identify the tables that need vacuuming:
185+
```sql
186+
SELECT *
187+
FROM run_command_on_workers($$
188+
SELECT json_agg(t)
189+
FROM (
190+
SELECT schemaname, relname
191+
,n_live_tup, n_dead_tup
192+
,n_dead_tup / (n_live_tup) AS bloat
193+
,last_autovacuum, last_autoanalyze
194+
,last_vacuum, last_analyze
195+
FROM pg_stat_user_tables
196+
WHERE n_live_tup > 0 AND relname LIKE '%orders%'
197+
ORDER BY n_dead_tup DESC
198+
) t
199+
$$);
200+
```
201+
202+
The following image highlights the output resulting from the above query. The "result" column is a json datatype containing information on the stats.
203+
204+
:::image type="content" source="./media/how-to-high-cpu-utilization-elastic-clusters/elastic-clusters-cpu-utilization-result.png" alt-text="Results returned from query response - including `result` column as a json datatype " lightbox="./media/how-to-high-cpu-utilization-elastic-clusters/elastic-clusters-cpu-utilization-result.png":::
205+
206+
The last_autovacuum and last_autoanalyze columns provide the date and time when the table was last auto vacuumed or analyzed. If the tables aren't being vacuumed regularly, take steps to tune autovacuum.
207+
208+
The following query provides information regarding the amount of bloat at the schema level:
209+
210+
```sql
211+
SELECT *
212+
FROM run_command_on_workers($$
213+
SELECT json_agg(t) FROM (
214+
SELECT schemaname, sum(n_live_tup) AS live_tuples
215+
, sum(n_dead_tup) AS dead_tuples
216+
FROM pg_stat_user_tables
217+
WHERE n_live_tup > 0
218+
GROUP BY schemaname
219+
ORDER BY sum(n_dead_tup) DESC
220+
) t
221+
$$);
222+
```
223+
224+
## Resolve High CPU Utilization
225+
226+
Use EXPLAIN ANALYZE to examine any slow queries and terminate any improperly long running transactions. Consider using the built-in PgBouncer connection pooler and clear up excessive bloat to resolve high CPU utilization.
227+
228+
### Use EXPLAIN ANALYZE
229+
230+
Once you know the queries that are consuming more CPU, use **EXPLAIN ANALYZE** to further investigate and tune them.
231+
232+
For more information about the **EXPLAIN ANALYZE** command, review its [documentation](https://www.postgresql.org/docs/current/sql-explain.html).
233+
234+
### Terminate long running transactions on a nodes
235+
236+
You can consider terminating a long running transaction as an option if the transaction is running longer than expected.
237+
238+
To terminate a session's PID, you need to find its PID by using the following query:
239+
240+
```sql
241+
SELECT
242+
pid,
243+
datname,
244+
usename,
245+
application_name,
246+
client_addr,
247+
backend_start,
248+
query_start,
249+
now() - query_start AS duration,
250+
state,
251+
wait_event,
252+
wait_event_type,
253+
query
254+
FROM pg_stat_activity WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
255+
ORDER BY now() - query_start DESC;
256+
```
257+
258+
You can also filter by other properties like usename (user name), datname (database name), etc.
259+
260+
Once you have the session's PID, you can terminate it using the following query:
261+
262+
```sql
263+
SELECT pg_terminate_backend(pid);
264+
```
265+
266+
Terminating the pid ends the specific sessions related to a node.
267+
268+
### Terminate long running transactions on all nodes
269+
270+
You could consider ending a long running transaction as an option.
271+
272+
To terminate a session's PID, you need to find its PID, global_pid by using the following query:
273+
274+
```sql
275+
SELECT
276+
global_pid,
277+
pid,
278+
nodeid,
279+
datname,
280+
usename,
281+
application_name,
282+
client_addr,
283+
backend_start,
284+
query_start,
285+
now() - query_start AS duration,
286+
state,
287+
wait_event,
288+
wait_event_type,
289+
query
290+
FROM citus_stat_activity WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
291+
ORDER BY now() - query_start DESC;
292+
```
293+
294+
You can also filter by other properties like usename (user name), datname (database name), etc.
295+
296+
Once you have the session's PID, you can terminate it using the following query:
297+
298+
```sql
299+
SELECT pg_terminate_backend(pid);
300+
```
301+
Terminating the pid ends the specific sessions related to a worker node.
302+
303+
The same query running on different worker nodes might have same global_pid’s. In that case, you can end long running transaction on all worker nodes use global_pid.
304+
305+
The following screenshot shows the relativity of the global_pid’s to session pid’s.
306+
307+
:::image type="content" source="./media/how-to-high-cpu-utilization-elastic-clusters/global-pid-to-session-pid-example.png" alt-text="global pid to session pid reference example" lightbox="./media/how-to-high-cpu-utilization-elastic-clusters/global-pid-to-session-pid-example.png":::
308+
309+
```sql
310+
SELECT pg_terminate_backend(global_pid);
311+
```
312+
313+
> [!NOTE]
314+
> To terminate long running transactions, it is advised to set server parameters `statement_timeout` or `idle_in_transaction_session_timeout`.
315+
316+
## Clearing bloat
317+
318+
A short-term solution would be to manually vacuum and then analyze the tables where slow queries are seen:
319+
320+
```sql
321+
VACUUM ANALYZE <table>;
322+
```
323+
324+
## Managing Connections
325+
326+
In situations where there are many short-lived connections, or many connections that remain idle for most of their life, consider using a connection pooler like PgBouncer.
327+
328+
## PgBouncer, a built-in connection pooler
329+
330+
For more information about PgBouncer, see [connection pooler](https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/not-all-postgres-connection-pooling-is-equal/ba-p/825717) and [connection handling best practices with PostgreSQL](https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/connection-handling-best-practice-with-postgresql/ba-p/790883)
331+
332+
Azure Database for PostgreSQL Elastic Clusters offer PgBouncer as a built-in connection pooling solution. For more information, see [PgBouncer](concepts-pgbouncer.md).
333+
334+
## Related content
335+
336+
- [Server parameters in Azure Database for PostgreSQL](concepts-server-parameters.md).
337+
- [Autovacuum tuning in Azure Database for PostgreSQL](how-to-autovacuum-tuning.md).
218 KB
Loading
619 KB
Loading

0 commit comments

Comments
 (0)