|
| 1 | +--- |
| 2 | +title: Identify Slow-Running Queries on Elastic Clusters |
| 3 | +description: Troubleshooting guide for identifying slow-running queries in Azure Database for PostgreSQL elastic clusters. |
| 4 | +author: GayathriPaderla |
| 5 | +ms.author: gapaderla |
| 6 | +ms.reviewer: jaredmeade, maghan |
| 7 | +ms.date: 03/27/2026 |
| 8 | +ms.service: azure-database-postgresql |
| 9 | +ms.subservice: performance |
| 10 | +ms.topic: troubleshooting-general |
| 11 | +--- |
| 12 | + |
| 13 | +# Troubleshoot and identify slow-running queries in Azure Database for PostgreSQL Elastic Clusters |
| 14 | + |
| 15 | +This article describes how to identify and diagnose the root cause of slow-running queries. These queries can consume CPU resources and lead to high CPU utilization. |
| 16 | + |
| 17 | +## Identify the slow query |
| 18 | + |
| 19 | +Use `pg_stat_statements` to identify the slow query. The following query helps you find the top five slowest operations. |
| 20 | + |
| 21 | +```sql |
| 22 | +SELECT userid::regrole, dbid, query, mean_exec_time |
| 23 | +FROM pg_stat_statements |
| 24 | +ORDER BY mean_exec_time DESC LIMIT 5; |
| 25 | +``` |
| 26 | + |
| 27 | +## Inspect current active or long-running queries |
| 28 | + |
| 29 | +The following query helps you identify queries running for more than 15 minutes. |
| 30 | + |
| 31 | +```sql |
| 32 | +SELECT |
| 33 | + global_pid,pid, |
| 34 | + nodeid, |
| 35 | + datname, |
| 36 | + usename, |
| 37 | + application_name, |
| 38 | + client_addr, |
| 39 | + backend_start, |
| 40 | + query_start, |
| 41 | + NOW() - query_start AS duration, |
| 42 | + state, |
| 43 | + wait_event, |
| 44 | + wait_event_type, |
| 45 | + query |
| 46 | +FROM citus_stat_activity |
| 47 | +WHERE state != 'idle' |
| 48 | +AND pid <> pg_backend_pid() |
| 49 | +AND state IN ('idle in transaction', 'active') |
| 50 | +AND NOW() - query_start > '15 minutes' |
| 51 | +ORDER BY NOW() - query_start DESC; |
| 52 | +``` |
| 53 | + |
| 54 | +:::image type="content" source="media/how-to-identify-slow-queries-elastic-clusters/long-running-queries.png" alt-text="Screenshot of long-running queries result." lightbox="media/how-to-identify-slow-queries-elastic-clusters/long-running-queries.png"::: |
| 55 | + |
| 56 | +This result shows there's one query on the server that runs slow and takes longer execution times. |
| 57 | + |
| 58 | +The `global_pid` associated with the long-running query is the same, which means the same query runs the longest on all the worker nodes. |
| 59 | + |
| 60 | +### Identify the tables and their distribution type in the query |
| 61 | + |
| 62 | + - The distributed tables |
| 63 | + - The reference tables |
| 64 | + - The colocation tables |
| 65 | + |
| 66 | +If the query uses regular tables, change them to either reference tables or colocation tables. To find this information, use the following query. |
| 67 | + |
| 68 | +```sql |
| 69 | +SELECT table_name, |
| 70 | + distribution_type, |
| 71 | + distribution_column, |
| 72 | + shard_count, |
| 73 | + colocation_id |
| 74 | +FROM citus_tables |
| 75 | +ORDER BY table_name; |
| 76 | +``` |
| 77 | + |
| 78 | +What to look for in the preceding query: |
| 79 | + |
| 80 | +- `distribution_type = reference` → broadcast joins |
| 81 | +- Missing or wrong `distribution_column` |
| 82 | + |
| 83 | +### Solution |
| 84 | + |
| 85 | +Changing the regular table to a reference or colocation table reduces network activity between nodes. |
| 86 | + |
| 87 | +```sql |
| 88 | +SELECT create_reference_table('products'); |
| 89 | +``` |
| 90 | + |
| 91 | +## Detect non-colocated tables used in joins |
| 92 | + |
| 93 | +One of the top causes for slow queries is a non-colocated table. Here's a query to identify non-colocated tables. |
| 94 | + |
| 95 | +```sql |
| 96 | +SELECT a.table_name AS table_a, |
| 97 | + b.table_name AS table_b, |
| 98 | + a.colocation_id AS colocation_a, |
| 99 | + b.colocation_id AS colocation_b |
| 100 | +FROM citus_tables a |
| 101 | +JOIN citus_tables b |
| 102 | + ON a.table_name <> b.table_name |
| 103 | +WHERE a.colocation_id <> b.colocation_id; |
| 104 | +``` |
| 105 | + |
| 106 | +What to look for in the preceding query: |
| 107 | + |
| 108 | + - If your tables are listed, consider colocating them. Colocating tables prevents: |
| 109 | + - Data reshuffling across nodes |
| 110 | + - Network overhead |
| 111 | + - Temp file spills |
| 112 | + |
| 113 | +You can also identify these symptoms by reviewing the execution plans of your query. Pay attention to these action types: |
| 114 | + |
| 115 | + - Distributed Repartition Join |
| 116 | + - Distributed Subplan/Union |
| 117 | + |
| 118 | +### Solution |
| 119 | + |
| 120 | +- Distribute tables on the join key. |
| 121 | +- Make sure you join the distributed table and reference table correctly. |
| 122 | +- Index the join keys. |
| 123 | +- Fix colocation of the table by pointing the table to the right distribution key. |
| 124 | + - You might need to recombine the table and then distribute the table using a more appropriate distribution key. |
| 125 | + |
| 126 | +```sql |
| 127 | +SELECT undistribute_table('orders'); |
| 128 | +SELECT create_distributed_table('orders', 'customer_id'); |
| 129 | +``` |
| 130 | + |
| 131 | +## Check for skewness of data across shards and nodes |
| 132 | + |
| 133 | +The following query identifies which shards and nodes contain long-running queries, and their shard sizes. |
| 134 | + |
| 135 | +```sql |
| 136 | +SELECT |
| 137 | + shardid, |
| 138 | + cs.shard_size/1024/1024 AS shard_size_mb, |
| 139 | + nodeid, |
| 140 | + nodename, |
| 141 | + global_pid, |
| 142 | + pid, |
| 143 | + state, |
| 144 | + query, |
| 145 | + NOW() - query_start AS duration |
| 146 | +FROM citus_shards cs |
| 147 | +JOIN citus_stat_activity ON citus_stat_activity.query LIKE '%' || cs.shardid || '%' AND pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active') AND NOW() - query_start > '15 minutes' |
| 148 | +ORDER BY duration DESC; |
| 149 | +``` |
| 150 | + |
| 151 | +The results show that the queries access four specific shards in each of the worker nodes. |
| 152 | + |
| 153 | +If you see the majority of data on a subset of worker nodes, reconsider your distribution key selection. |
| 154 | + |
| 155 | +To troubleshoot further, review details of the distributed table by shards using the following query: |
| 156 | + |
| 157 | +```sql |
| 158 | +SELECT * FROM run_command_on_shards('orders', $$ SELECT json_build_object( 'shard_name', '%1$s', 'size', pg_size_pretty(pg_table_size('%1$s')) ); $$); |
| 159 | +``` |
| 160 | + |
| 161 | +### Solution |
| 162 | + |
| 163 | +Based on the preceding output, if the data is skewed to a few shards, the distribution key is likely the cause. Consider rearchitecting the distribution key. |
| 164 | + |
| 165 | +Here's a related talk on choosing the right shard key: [Efficiently distributing Postgres with Citus - How to choose the right shard key? | Citus Con 2022](https://www.youtube.com/watch?v=t0EXeWk3lAk). |
| 166 | + |
| 167 | +## Diagnose lock contention |
| 168 | + |
| 169 | +Check for locking and blocking by using the following query. |
| 170 | + |
| 171 | +```sql |
| 172 | +SELECT |
| 173 | + lw.waiting_gpid AS blocked_gpid, |
| 174 | + lw.blocking_gpid AS blocking_gpid, |
| 175 | + wa.query AS blocked_query, |
| 176 | + wa.state AS blocked_state, |
| 177 | + wa.wait_event AS blocked_wait_event, |
| 178 | + wa.wait_event_type AS blocked_wait_event_type, |
| 179 | + NOW() - wa.query_start AS blocked_duration, |
| 180 | + ba.query AS blocking_query, |
| 181 | + ba.state AS blocking_state, |
| 182 | + ba.wait_event AS blocking_wait_event, |
| 183 | + ba.wait_event_type AS blocking_wait_event_type, |
| 184 | + lw.waiting_nodeid, |
| 185 | + lw.blocking_nodeid |
| 186 | +FROM citus_lock_waits lw |
| 187 | +LEFT JOIN citus_stat_activity wa ON lw.waiting_gpid = wa.global_pid |
| 188 | +LEFT JOIN citus_stat_activity ba ON lw.blocking_gpid = ba.global_pid |
| 189 | +ORDER BY blocked_duration DESC NULLS LAST; |
| 190 | +``` |
| 191 | + |
| 192 | +### Solution |
| 193 | + |
| 194 | +Terminate the `blocking_gpid` by using the following command: |
| 195 | + |
| 196 | +```sql |
| 197 | +SELECT pg_terminate_backend(blocking_gpid); |
| 198 | +``` |
| 199 | + |
| 200 | +## Check for bloat in the tables involved in the slow query |
| 201 | + |
| 202 | +To see vacuum statistics details, run the following query: |
| 203 | + |
| 204 | +```sql |
| 205 | +SELECT * FROM run_command_on_all_nodes( $$ SELECT json_agg(t) FROM ( |
| 206 | + SELECT * FROM pg_stat_user_tables WHERE relname LIKE '%orders%' ORDER BY n_dead_tup DESC LIMIT 5 |
| 207 | +) t $$) ; |
| 208 | +``` |
| 209 | + |
| 210 | +This query provides the output in the following format. The result contains a JSON column with all the statistics information for the table. |
| 211 | + |
| 212 | +:::image type="content" source="media/how-to-identify-slow-queries-elastic-clusters/bloat.png" alt-text="Screenshot of bloat check query result." lightbox="media/how-to-identify-slow-queries-elastic-clusters/bloat.png"::: |
| 213 | + |
| 214 | +### Solution |
| 215 | + |
| 216 | +If the `n_dead_tup/n_live_tup` ratio is high, run `VACUUM` on the table. |
| 217 | + |
| 218 | +## Check the query plan for missing indexes |
| 219 | + |
| 220 | +Get the query plan by running the following command: |
| 221 | + |
| 222 | +```sql |
| 223 | +EXPLAIN (ANALYZE,BUFFERS) <query>; |
| 224 | +``` |
| 225 | + |
| 226 | +Look for sequential scan nodes in the query plan and the number of rows processed. If the number of rows is high and takes up the maximum execution time, consider adding indexes. |
| 227 | + |
| 228 | +### Solution |
| 229 | + |
| 230 | +Add appropriate indexes to the table to improve the query performance. |
| 231 | + |
| 232 | +## Check for cache and I/O efficiency |
| 233 | + |
| 234 | +To check the cache hit rate, use the following query. |
| 235 | + |
| 236 | +```sql |
| 237 | +SELECT * FROM run_command_on_all_nodes( $$ SELECT json_agg(t) FROM ( |
| 238 | + SELECT sum(heap_blks_read) AS Reads, sum(heap_blks_hit) AS Hits, 100 * sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_rate |
| 239 | + FROM pg_statio_user_tables |
| 240 | +) t $$ ); |
| 241 | +``` |
| 242 | + |
| 243 | +## Check the index cache hit rate |
| 244 | + |
| 245 | +```sql |
| 246 | +SELECT * FROM run_command_on_all_nodes( $$ SELECT json_agg(t) FROM ( |
| 247 | + SELECT sum(idx_blks_read) AS index_reads, sum(idx_blks_hit) AS index_hits, 100 * sum(idx_blks_hit) / (sum(idx_blks_hit) + sum(idx_blks_read)) AS index_cache_hit_rate |
| 248 | + FROM pg_statio_user_indexes |
| 249 | +) t $$ ); |
| 250 | +``` |
| 251 | + |
| 252 | +> [!NOTE] |
| 253 | +> This condition might happen when your server restarts or scales. In those cases, wait for your system to stabilize. |
| 254 | +
|
| 255 | +## Related content |
| 256 | + |
| 257 | +- [Troubleshoot high CPU utilization in Azure Database for PostgreSQL](how-to-high-cpu-utilization.md) |
| 258 | +- [Troubleshoot high IOPS utilization in Azure Database for PostgreSQL](how-to-high-io-utilization.md) |
| 259 | +- [Troubleshoot high memory utilization in Azure Database for PostgreSQL](how-to-high-memory-utilization.md) |
| 260 | +- [Server parameters in Azure Database for PostgreSQL](../server-parameters/concepts-server-parameters.md) |
0 commit comments