Skip to content

Commit 399b203

Browse files
authored
Merge pull request #4311 from JaredMSFT/main
Adding High CPU Troubleshooting on Elastic Clusters Article
2 parents 43cf931 + dbdc1ec commit 399b203

File tree

5 files changed

+342
-6
lines changed

5 files changed

+342
-6
lines changed

articles/postgresql/TOC.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -675,6 +675,9 @@
675675
- name: Troubleshoot high CPU utilization
676676
href: troubleshoot/how-to-high-cpu-utilization.md
677677
displayName: High CPU Utilization
678+
- name: Troubleshoot high CPU utilization on elastic clusters
679+
href: troubleshoot/how-to-high-cpu-utilization-elastic-clusters.md
680+
displayName: High CPU Utilization on Elastic Clusters
678681
- name: Troubleshoot high memory utilization
679682
href: troubleshoot/how-to-high-memory-utilization.md
680683
displayName: High Memory Utilization
Lines changed: 338 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,338 @@
1+
---
2+
title: Troubleshoot High CPU Utilization in Elastic Clusters
3+
description: How to troubleshoot high CPU utilization across Azure Database for PostgreSQL Elastic Clusters.
4+
author: GayathriPaderla
5+
ms.author: gapaderla
6+
ms.reviewer: jaredmeade, maghan
7+
ms.date: 02/17/2026
8+
ms.service: azure-database-postgresql
9+
ms.subservice: performance
10+
ms.topic: troubleshooting-general
11+
---
12+
13+
# Troubleshoot high CPU utilization in Azure Database for PostgreSQL Elastic Clusters
14+
15+
This article describes how to identify the root cause of high CPU utilization. It also provides possible remedial actions to control CPU utilization when using [Elastic clusters in Azure Database for PostgreSQL](../elastic-clusters/concepts-elastic-clusters.md).
16+
17+
In this article, you learn about:
18+
19+
- How to use tools like Azure Metrics, `pg_stat_statements`, `citus_stat_activity`, and `pg_stat_activity` to identify high CPU utilization.
20+
- How to identify root causes, such as long running queries and total connections.
21+
- How to resolve high CPU utilization by using `EXPLAIN ANALYZE` and vacuuming tables.
22+
23+
## Tools to identify high CPU utilization
24+
25+
Use the following tools to identify high CPU utilization:
26+
27+
### Azure Metrics
28+
29+
Azure Metrics is a good starting point to check the CPU utilization for a specific period. Metrics provide information about the resources utilized during the period in which you're monitoring. You can use the **Apply splitting** option and **Split by Server Name** to view the details of each individual node in your elastic cluster. You can then compare the performance of **Write IOPs, Read IOPs, Read Throughput Bytes/Sec**, and **Write Throughput Bytes/Sec** with **CPU percent**, to view the performance of individual nodes when you observe your workload consuming high CPU.
30+
31+
After you identify a particular node (or nodes) with higher than expected CPU utilization, you can connect directly to one or more nodes in question and perform a more in-depth analysis by using the following Postgres tools:
32+
33+
### pg_stat_statements
34+
35+
The `pg_stat_statements` extension helps identify queries that consume time on the server. For more information about this extension, see the detailed [documentation](https://www.postgresql.org/docs/current/pgstatstatements.html).
36+
37+
#### Calls/Mean and total execution time
38+
39+
The following query returns the top five SQL statements by highest total execution time:
40+
41+
```sql
42+
SELECT userid::regrole, dbid, query, total_exec_time, mean_exec_time, calls
43+
FROM pg_stat_statements
44+
ORDER BY total_exec_time
45+
DESC LIMIT 5;
46+
```
47+
48+
### pg_stat_activity
49+
50+
The `pg_stat_activity` view shows the queries that are currently running on the specific node. Use it to monitor active queries, sessions, and states on that node.
51+
52+
```sql
53+
SELECT *, now() - xact_start AS duration
54+
FROM pg_stat_activity
55+
WHERE state IN ('idle in transaction', 'active') AND pid <> pg_backend_pid()
56+
ORDER BY duration DESC;
57+
```
58+
59+
### citus_stat_activity
60+
61+
The `citus_stat_activity` view is a superset of `pg_stat_activity`. It shows the distributed queries that are running on all nodes. It also shows tasks specific to subqueries dispatched to workers, task state, and worker nodes.
62+
63+
```sql
64+
SELECT *, now() - xact_start AS duration
65+
FROM citus_stat_activity
66+
WHERE state IN ('idle in transaction', 'active') AND pid <> pg_backend_pid()
67+
ORDER BY duration DESC;
68+
```
69+
70+
## Identify root causes
71+
72+
If CPU consumption levels are high, the following scenarios might be the root causes:
73+
74+
### Long-running transactions on specific node
75+
76+
Long-running transactions consume CPU resources and lead to high CPU utilization.
77+
78+
The following query provides information on long-running transactions:
79+
80+
```sql
81+
SELECT
82+
pid,
83+
datname,
84+
usename,
85+
application_name,
86+
client_addr,
87+
backend_start,
88+
query_start,
89+
now() - query_start AS duration,
90+
state,
91+
wait_event,
92+
wait_event_type,
93+
query
94+
FROM pg_stat_activity
95+
WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
96+
ORDER BY now() - query_start DESC;
97+
```
98+
99+
### Long-running transactions on all nodes
100+
101+
Long-running transactions consume CPU resources and lead to high CPU utilization.
102+
103+
The following query provides information on long-running transactions across all nodes:
104+
105+
```sql
106+
SELECT
107+
global_pid, pid,
108+
nodeid,
109+
datname,
110+
usename,
111+
application_name,
112+
client_addr,
113+
backend_start,
114+
query_start,
115+
now() - query_start AS duration,
116+
state,
117+
wait_event,
118+
wait_event_type,
119+
query
120+
FROM citus_stat_activity
121+
WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
122+
ORDER BY now() - query_start DESC;
123+
```
124+
125+
### Slow query
126+
127+
Slow queries consume CPU resources and cause high CPU utilization.
128+
129+
The following query helps you identify queries that take longer run times:
130+
131+
```sql
132+
SELECT
133+
query,
134+
calls,
135+
mean_exec_time,
136+
total_exec_time,
137+
rows,
138+
shared_blks_hit,
139+
shared_blks_read,
140+
shared_blks_dirtied,
141+
shared_blks_written,
142+
temp_blks_read,
143+
temp_blks_written,
144+
wal_records,
145+
wal_fpi,
146+
wal_bytes
147+
FROM pg_stat_statements
148+
WHERE query ILIKE '%select%' OR query ILIKE '%insert%' OR query ILIKE '%update%' OR query ILIKE '%delete%' OR queryid = <queryid>
149+
ORDER BY total_exec_time DESC;
150+
```
151+
152+
### Total number of connections and number of connections by state on a node
153+
154+
Many connections to the database lead to increased CPU utilization.
155+
156+
The following query provides information about the number of connections by state on a single node:
157+
158+
```sql
159+
SELECT state, COUNT(*)
160+
FROM pg_stat_activity
161+
WHERE pid <> pg_backend_pid()
162+
GROUP BY state
163+
ORDER BY state ASC;
164+
```
165+
166+
### Total number of connections and number of connections by state on all nodes
167+
168+
Many connections to the database lead to increased CPU utilization.
169+
170+
The following query gives information about the number of connections by state across all nodes:
171+
172+
```sql
173+
SELECT state, COUNT(*)
174+
FROM citus_stat_activity
175+
WHERE pid <> pg_backend_pid()
176+
GROUP BY state
177+
ORDER BY state ASC;
178+
```
179+
180+
### Vacuum and table stats
181+
182+
Keeping table statistics up to date helps improve query performance. Monitor whether regular autovacuuming is happening.
183+
184+
The following query helps you identify the tables that need vacuuming:
185+
186+
```sql
187+
SELECT *
188+
FROM run_command_on_all_nodes($$
189+
SELECT json_agg(t)
190+
FROM (
191+
SELECT schemaname, relname
192+
,n_live_tup, n_dead_tup
193+
,n_dead_tup / (n_live_tup) AS bloat
194+
,last_autovacuum, last_autoanalyze
195+
,last_vacuum, last_analyze
196+
FROM pg_stat_user_tables
197+
WHERE n_live_tup > 0 AND relname LIKE '%orders%'
198+
ORDER BY n_dead_tup DESC
199+
) t
200+
$$);
201+
```
202+
203+
The following image highlights the output from the preceding query. The `result` column is a JSON data type containing information on the stats.
204+
205+
:::image type="content" source="./media/how-to-high-cpu-utilization-elastic-clusters/elastic-clusters-cpu-utilization-result.png" alt-text="Results returned from query response - including `result` column as a json datatype " lightbox="./media/how-to-high-cpu-utilization-elastic-clusters/elastic-clusters-cpu-utilization-result.png":::
206+
207+
The `last_autovacuum` and `last_autoanalyze` columns provide the date and time when the table was last autovacuumed or analyzed. If the tables aren't autovacuumed regularly, take steps to tune autovacuum.
208+
209+
The following query provides information about the amount of bloat at the schema level:
210+
211+
```sql
212+
SELECT *
213+
FROM run_command_on_all_nodes($$
214+
SELECT json_agg(t) FROM (
215+
SELECT schemaname, sum(n_live_tup) AS live_tuples
216+
, sum(n_dead_tup) AS dead_tuples
217+
FROM pg_stat_user_tables
218+
WHERE n_live_tup > 0
219+
GROUP BY schemaname
220+
ORDER BY sum(n_dead_tup) DESC
221+
) t
222+
$$);
223+
```
224+
225+
## Resolve high CPU utilization
226+
227+
Use EXPLAIN ANALYZE to examine any slow queries and terminate any improperly long running transactions. Consider using the built-in PgBouncer connection pooler and clear up excessive bloat to resolve high CPU utilization.
228+
229+
### Use EXPLAIN ANALYZE
230+
231+
After you identify the queries that consume more CPUs, use **EXPLAIN ANALYZE** to further investigate and tune them.
232+
233+
For more information about the **EXPLAIN ANALYZE** command, see its [documentation](https://www.postgresql.org/docs/current/sql-explain.html).
234+
235+
### Terminate long running transactions on a node
236+
237+
Consider terminating a long running transaction if the transaction runs longer than expected.
238+
239+
To terminate a session's PID, first find the PID by using the following query:
240+
241+
```sql
242+
SELECT
243+
pid,
244+
datname,
245+
usename,
246+
application_name,
247+
client_addr,
248+
backend_start,
249+
query_start,
250+
now() - query_start AS duration,
251+
state,
252+
wait_event,
253+
wait_event_type,
254+
query
255+
FROM pg_stat_activity WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
256+
ORDER BY now() - query_start DESC;
257+
```
258+
259+
You can also filter by other properties like `usename` (user name), `datname` (database name), and more.
260+
261+
After you get the session's PID, terminate it by using the following query:
262+
263+
```sql
264+
SELECT pg_terminate_backend(pid);
265+
```
266+
267+
Terminating the PID ends the specific sessions related to a node.
268+
269+
### Terminate long running transactions on all nodes
270+
271+
Consider ending a long running transaction.
272+
273+
To terminate a session's PID, find its PID and global_pid by using the following query:
274+
275+
```sql
276+
SELECT
277+
global_pid,
278+
pid,
279+
nodeid,
280+
datname,
281+
usename,
282+
application_name,
283+
client_addr,
284+
backend_start,
285+
query_start,
286+
now() - query_start AS duration,
287+
state,
288+
wait_event,
289+
wait_event_type,
290+
query
291+
FROM citus_stat_activity WHERE state != 'idle' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
292+
ORDER BY now() - query_start DESC;
293+
```
294+
295+
You can also filter by other properties like `usename` (user name), `datname` (database name), and more.
296+
297+
After you get the session's PID, terminate it by using the following query:
298+
299+
```sql
300+
SELECT pg_terminate_backend(pid);
301+
```
302+
Terminating the pid ends the specific sessions related to a worker node.
303+
304+
The same query running on different worker nodes might have same global_pid's. In that case, you can end long running transaction on all worker nodes use global_pid.
305+
306+
The following screenshot shows the relativity of the global_pid's to session pid's.
307+
308+
:::image type="content" source="./media/how-to-high-cpu-utilization-elastic-clusters/global-pid-to-session-pid-example.png" alt-text="global pid to session pid reference example" lightbox="./media/how-to-high-cpu-utilization-elastic-clusters/global-pid-to-session-pid-example.png":::
309+
310+
```sql
311+
SELECT pg_terminate_backend(global_pid);
312+
```
313+
314+
> [!NOTE]
315+
> To terminate long running transactions, set server parameters `statement_timeout` or `idle_in_transaction_session_timeout`.
316+
317+
## Clearing bloat
318+
319+
A short-term solution is to manually vacuum and then analyze the tables where slow queries appear:
320+
321+
```sql
322+
VACUUM ANALYZE <table>;
323+
```
324+
325+
## Managing connections
326+
327+
If your application uses many short-lived connections or many connections that stay idle for most of their life, consider using a connection pooler like PgBouncer.
328+
329+
## PgBouncer, a built-in connection pooler
330+
331+
For more information about PgBouncer, see [connection pooler](https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/not-all-postgres-connection-pooling-is-equal/ba-p/825717) and [connection handling best practices with PostgreSQL](https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/connection-handling-best-practice-with-postgresql/ba-p/790883).
332+
333+
Azure Database for PostgreSQL Elastic Clusters offer PgBouncer as a built-in connection pooling solution. For more information, see [PgBouncer](../connectivity/concepts-pgbouncer.md).
334+
335+
## Related content
336+
337+
- [Server parameters in Azure Database for PostgreSQL](../server-parameters/concepts-server-parameters.md)
338+
- [Autovacuum tuning in Azure Database for PostgreSQL](how-to-autovacuum-tuning.md)
218 KB
Loading
619 KB
Loading

docfx.json

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -270,14 +270,9 @@
270270
]
271271
},
272272
"titleSuffix": {
273+
"articles/postgresql/**/*": "Azure Database for PostgreSQL",
273274
"articles/mysql/flexible-server/**/*.md": "Azure Database for MySQL",
274-
"articles/postgresql/scripts/**/*.md": "Azure Database for PostgreSQL",
275-
"articles/postgresql/flexible-server/**/*.md": "Azure Database for PostgreSQL",
276-
"articles/postgresql/migrate/**/*.md": "Azure Database for PostgreSQL",
277275
"articles/mysql/flexible-server/**/*.yml": "Azure Database for MySQL",
278-
"articles/postgresql/scripts/**/*.yml": "Azure Database for PostgreSQL",
279-
"articles/postgresql/flexible-server/**/*.yml": "Azure Database for PostgreSQL",
280-
"articles/postgresql/migrate/**/*.yml": "Azure Database for PostgreSQL",
281276
"articles/cosmos-db/**/*": "Azure Cosmos DB",
282277
"articles/cosmos-db/mongodb/**/*": "Azure Cosmos DB for MongoDB",
283278
"articles/cosmos-db/postgresql/**/*": "Azure Cosmos DB for PostgreSQL",

0 commit comments

Comments
 (0)