Skip to content

Commit 89b91f7

Browse files
authored
Merge pull request #4573 from JaredMSFT/troubleshoot-slow-queries-ec
Troubleshoot Slow Queries When Using Elastic Clusters
2 parents f9bbe28 + 3b046a2 commit 89b91f7

File tree

4 files changed

+263
-0
lines changed

4 files changed

+263
-0
lines changed

articles/postgresql/TOC.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -706,6 +706,9 @@
706706
- name: Troubleshoot and identify slow running queries
707707
href: troubleshoot/how-to-identify-slow-queries.md
708708
displayName: Troubleshoot and identify slow running queries
709+
- name: Troubleshoot and identify slow running queries on elastic clusters
710+
href: troubleshoot/how-to-identify-slow-queries-elastic-clusters.md
711+
displayName: Troubleshoot and identify slow running queries on elastic clusters
709712
- name: Troubleshoot connections
710713
href: troubleshoot/how-to-troubleshoot-common-connection-issues.md
711714
displayName: Troubleshoot connections
Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
---
2+
title: Identify Slow-Running Queries on Elastic Clusters
3+
description: Troubleshooting guide for identifying slow-running queries in Azure Database for PostgreSQL elastic clusters.
4+
author: GayathriPaderla
5+
ms.author: gapaderla
6+
ms.reviewer: jaredmeade, maghan
7+
ms.date: 03/27/2026
8+
ms.service: azure-database-postgresql
9+
ms.subservice: performance
10+
ms.topic: troubleshooting-general
11+
---
12+
13+
# Troubleshoot and identify slow-running queries in Azure Database for PostgreSQL Elastic Clusters
14+
15+
This article describes how to identify and diagnose the root cause of slow-running queries. These queries can consume CPU resources and lead to high CPU utilization.
16+
17+
## Identify the slow query
18+
19+
Use `pg_stat_statements` to identify the slow query. The following query helps you find the top five slowest operations.
20+
21+
```sql
22+
SELECT userid::regrole, dbid, query, mean_exec_time
23+
FROM pg_stat_statements
24+
ORDER BY mean_exec_time DESC LIMIT 5;
25+
```
26+
27+
## Inspect current active or long-running queries
28+
29+
The following query helps you identify queries running for more than 15 minutes.
30+
31+
```sql
32+
SELECT
33+
global_pid,pid,
34+
nodeid,
35+
datname,
36+
usename,
37+
application_name,
38+
client_addr,
39+
backend_start,
40+
query_start,
41+
NOW() - query_start AS duration,
42+
state,
43+
wait_event,
44+
wait_event_type,
45+
query
46+
FROM citus_stat_activity
47+
WHERE state != 'idle'
48+
AND pid <> pg_backend_pid()
49+
AND state IN ('idle in transaction', 'active')
50+
AND NOW() - query_start > '15 minutes'
51+
ORDER BY NOW() - query_start DESC;
52+
```
53+
54+
:::image type="content" source="media/how-to-identify-slow-queries-elastic-clusters/long-running-queries.png" alt-text="Screenshot of long-running queries result." lightbox="media/how-to-identify-slow-queries-elastic-clusters/long-running-queries.png":::
55+
56+
This result shows there's one query on the server that runs slow and takes longer execution times.
57+
58+
The `global_pid` associated with the long-running query is the same, which means the same query runs the longest on all the worker nodes.
59+
60+
### Identify the tables and their distribution type in the query
61+
62+
- The distributed tables
63+
- The reference tables
64+
- The colocation tables
65+
66+
If the query uses regular tables, change them to either reference tables or colocation tables. To find this information, use the following query.
67+
68+
```sql
69+
SELECT table_name,
70+
distribution_type,
71+
distribution_column,
72+
shard_count,
73+
colocation_id
74+
FROM citus_tables
75+
ORDER BY table_name;
76+
```
77+
78+
What to look for in the preceding query:
79+
80+
- `distribution_type = reference` → broadcast joins
81+
- Missing or wrong `distribution_column`
82+
83+
### Solution
84+
85+
Changing the regular table to a reference or colocation table reduces network activity between nodes.
86+
87+
```sql
88+
SELECT create_reference_table('products');
89+
```
90+
91+
## Detect non-colocated tables used in joins
92+
93+
One of the top causes for slow queries is a non-colocated table. Here's a query to identify non-colocated tables.
94+
95+
```sql
96+
SELECT a.table_name AS table_a,
97+
b.table_name AS table_b,
98+
a.colocation_id AS colocation_a,
99+
b.colocation_id AS colocation_b
100+
FROM citus_tables a
101+
JOIN citus_tables b
102+
ON a.table_name <> b.table_name
103+
WHERE a.colocation_id <> b.colocation_id;
104+
```
105+
106+
What to look for in the preceding query:
107+
108+
- If your tables are listed, consider colocating them. Colocating tables prevents:
109+
- Data reshuffling across nodes
110+
- Network overhead
111+
- Temp file spills
112+
113+
You can also identify these symptoms by reviewing the execution plans of your query. Pay attention to these action types:
114+
115+
- Distributed Repartition Join
116+
- Distributed Subplan/Union
117+
118+
### Solution
119+
120+
- Distribute tables on the join key.
121+
- Make sure you join the distributed table and reference table correctly.
122+
- Index the join keys.
123+
- Fix colocation of the table by pointing the table to the right distribution key.
124+
- You might need to recombine the table and then distribute the table using a more appropriate distribution key.
125+
126+
```sql
127+
SELECT undistribute_table('orders');
128+
SELECT create_distributed_table('orders', 'customer_id');
129+
```
130+
131+
## Check for skewness of data across shards and nodes
132+
133+
The following query identifies which shards and nodes contain long-running queries, and their shard sizes.
134+
135+
```sql
136+
SELECT
137+
shardid,
138+
cs.shard_size/1024/1024 AS shard_size_mb,
139+
nodeid,
140+
nodename,
141+
global_pid,
142+
pid,
143+
state,
144+
query,
145+
NOW() - query_start AS duration
146+
FROM citus_shards cs
147+
JOIN citus_stat_activity ON citus_stat_activity.query LIKE '%' || cs.shardid || '%' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active') AND NOW() - query_start > '15 minutes'
148+
ORDER BY duration DESC;
149+
```
150+
151+
The results show that the queries access four specific shards in each of the worker nodes.
152+
153+
If you see the majority of data on a subset of worker nodes, reconsider your distribution key selection.
154+
155+
To troubleshoot further, review details of the distributed table by shards using the following query:
156+
157+
```sql
158+
SELECT * FROM run_command_on_shards('orders', $$ SELECT json_build_object( 'shard_name', '%1$s', 'size', pg_size_pretty(pg_table_size('%1$s')) ); $$);
159+
```
160+
161+
### Solution
162+
163+
Based on the preceding output, if the data is skewed to a few shards, the distribution key is likely the cause. Consider rearchitecting the distribution key.
164+
165+
Here's a related talk on choosing the right shard key: [Efficiently distributing Postgres with Citus - How to choose the right shard key? | Citus Con 2022](https://www.youtube.com/watch?v=t0EXeWk3lAk).
166+
167+
## Diagnose lock contention
168+
169+
Check for locking and blocking by using the following query.
170+
171+
```sql
172+
SELECT
173+
lw.waiting_gpid AS blocked_gpid,
174+
lw.blocking_gpid AS blocking_gpid,
175+
wa.query AS blocked_query,
176+
wa.state AS blocked_state,
177+
wa.wait_event AS blocked_wait_event,
178+
wa.wait_event_type AS blocked_wait_event_type,
179+
NOW() - wa.query_start AS blocked_duration,
180+
ba.query AS blocking_query,
181+
ba.state AS blocking_state,
182+
ba.wait_event AS blocking_wait_event,
183+
ba.wait_event_type AS blocking_wait_event_type,
184+
lw.waiting_nodeid,
185+
lw.blocking_nodeid
186+
FROM citus_lock_waits lw
187+
LEFT JOIN citus_stat_activity wa ON lw.waiting_gpid = wa.global_pid
188+
LEFT JOIN citus_stat_activity ba ON lw.blocking_gpid = ba.global_pid
189+
ORDER BY blocked_duration DESC NULLS LAST;
190+
```
191+
192+
### Solution
193+
194+
Terminate the `blocking_gpid` by using the following command:
195+
196+
```sql
197+
SELECT pg_terminate_backend(blocking_gpid);
198+
```
199+
200+
## Check for bloat in the tables involved in the slow query
201+
202+
To see vacuum statistics details, run the following query:
203+
204+
```sql
205+
SELECT * FROM run_command_on_all_nodes( $$ SELECT json_agg(t) FROM (
206+
SELECT * FROM pg_stat_user_tables WHERE relname LIKE '%orders%' ORDER BY n_dead_tup DESC LIMIT 5
207+
) t $$) ;
208+
```
209+
210+
This query provides the output in the following format. The result contains a JSON column with all the statistics information for the table.
211+
212+
:::image type="content" source="media/how-to-identify-slow-queries-elastic-clusters/bloat.png" alt-text="Screenshot of bloat check query result." lightbox="media/how-to-identify-slow-queries-elastic-clusters/bloat.png":::
213+
214+
### Solution
215+
216+
If the `n_dead_tup/n_live_tup` ratio is high, run `VACUUM` on the table.
217+
218+
## Check the query plan for missing indexes
219+
220+
Get the query plan by running the following command:
221+
222+
```sql
223+
EXPLAIN (ANALYZE,BUFFERS) <query>;
224+
```
225+
226+
Look for sequential scan nodes in the query plan and the number of rows processed. If the number of rows is high and takes up the maximum execution time, consider adding indexes.
227+
228+
### Solution
229+
230+
Add appropriate indexes to the table to improve the query performance.
231+
232+
## Check for cache and I/O efficiency
233+
234+
To check the cache hit rate, use the following query.
235+
236+
```sql
237+
SELECT * FROM run_command_on_all_nodes( $$ SELECT json_agg(t) FROM (
238+
SELECT sum(heap_blks_read) AS Reads, sum(heap_blks_hit) AS Hits, 100 * sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_rate
239+
FROM pg_statio_user_tables
240+
) t $$ );
241+
```
242+
243+
## Check the index cache hit rate
244+
245+
```sql
246+
SELECT * FROM run_command_on_all_nodes( $$ SELECT json_agg(t) FROM (
247+
SELECT sum(idx_blks_read) AS index_reads, sum(idx_blks_hit) AS index_hits, 100 * sum(idx_blks_hit) / (sum(idx_blks_hit) + sum(idx_blks_read)) AS index_cache_hit_rate
248+
FROM pg_statio_user_indexes
249+
) t $$ );
250+
```
251+
252+
> [!NOTE]
253+
> This condition might happen when your server restarts or scales. In those cases, wait for your system to stabilize.
254+
255+
## Related content
256+
257+
- [Troubleshoot high CPU utilization in Azure Database for PostgreSQL](how-to-high-cpu-utilization.md)
258+
- [Troubleshoot high IOPS utilization in Azure Database for PostgreSQL](how-to-high-io-utilization.md)
259+
- [Troubleshoot high memory utilization in Azure Database for PostgreSQL](how-to-high-memory-utilization.md)
260+
- [Server parameters in Azure Database for PostgreSQL](../server-parameters/concepts-server-parameters.md)
60.9 KB
Loading
668 KB
Loading

0 commit comments

Comments
 (0)