Andrew C. Oliver
Contributing Writer

9 crushing performance problems in scalable systems

analysis
Jul 27, 20176 mins
DatabasesJavaNoSQL Databases

We have no shortage of ways to slow complex systems to a crawl. Fix these nine, and a tenth will be along soon

If you have deployed a few systems of scale, you know that some design problems are worse than others. Itโ€™s one thing to write tight code, and another thing to avoid introducing performance-crushing design flaws into the system.

Here are nine common problems โ€“ poor design choices, really โ€“ that will cause your system to spin its wheels, or even turn against itself. Unlike many bad decisions, these can be reversed.

1. N+1 queries

If you select all of a customerโ€™s orders in one query then loop through selecting each orderโ€™s line items in a query per order, thatโ€™s n trips to the database plus one. One big query with an outer join would be more efficient. If you need to pull back fewer at a time you can use a form of paging. Developers using caches that fill themselves often write n+1 problems by accident. You can find these situations with database monitoring tools such as Oracle Enterprise Monitor (OEM) or APM tools such as Wily Introscope or just plain query logging. There are worse versions of this problem such as people who try and crawl a tree stored in flat tables instead of using CTEs. There are also equivalent versions of these problems in NoSQL databases, so no one is safe.

2. Page or record locking

Man, I used to loathe DB2 and SQL Server. I still loathe their default locking model. Depending on platform, DB2 locks a โ€œpageโ€ of records for updates. You end up locking records that arenโ€™t even related to what youโ€™re doing just by happenstance. Row locks are more common. A longer-running transaction makes a minor update to a row that doesnโ€™t really affect anything else. All other queries block. Meanwhile those transactions hold any locks they have longer, creating a cascading performance problem. If youโ€™re on either of those databases, turn on and design for snapshot isolation. Oracle uses a form of snapshot isolation by default. There are NoSQL databases that can be configured for paranoid levels of consistency. Understand its locking model before you hurt yourself.

3. Thread synchronization

This problem comes in many forms. Sometimes itโ€™s hidden in a library. Years ago XML parsers used to validate the MIME type using a Java library called the Bean Activation Framework that used the old Java โ€œHashtableโ€ collection that synchronized every method. That meant all threads doing XML parsing eventually queued in the same place, creating a massive concurrency bottleneck. You can find this problem by reading thread dumps. A lot of modern developers have gotten used to tools that handle most threading for them. This is fine until something doesnโ€™t work properly. Everyone needs to understand concurrency.

4. Database sequences

If you just need a unique ID, donโ€™t use a sequence. Only use a sequence if you legitimately need each ID to be wellโ€ฆ sequential. How are sequences implemented? Thread locks. That means everything going for that sequence blocks. As an alternative use a randomly generated UUID that uses a secure random algorithm. Although it is theoretically possible to get a duplicate, after generating trillions of rows you still have a better chance of getting hit in the head with a meteorite. Iโ€™ve had developers actually sit in front of me bareheaded with no meteorite helmet and tell me that not even a theoretical chance of duplicates was acceptable in their system but I guess they didnโ€™t value their own life as much. At least wear foil, sheesh.

5. Opening connections

Whether it be database, HTTP, or whatever, pool that stuff. Also on larger systems donโ€™t try and open them all at once because youโ€™ll find out your database wasnโ€™t designed to do that!

6. Swapping

You need more memory than you can hope to use. If you use any swap at all, thatโ€™s bad. I used to configure my Linux boxes with no swap enabled because I wanted them to just crash and burn rather than silently kill my software.

7. I/O synchronization

Most caching software has the capacity to do โ€œwrite behindโ€ where data is written to memory on at least two machines and life goes on rather than waiting for disk to catch up. This โ€œsoftensโ€ the read-write wave. Eventually if write throughput is high enough youโ€™ll have to block to catch up before the write-behind cache blows up. I/O sync exists elsewhere too, in things like log files and really anywhere you persist anything. Some software still calls fsync a lot and thatโ€™s not something you want in your high-end distributed scalable software โ€“ at least not with a lot of help.

8. Process spawning

Some software packages, especially on Unix operating systems, are still designed around processes and subprocesses where each process has a single thread. You can pool these and reuse them or other bandaging but this just isnโ€™t a good design. Each process and child process gets its own memory space. Sometimes you can allocate another horrible idea called shared memory. Modern software has processes that manage multiple threads. This scales much better on modern multicore CPUs where each core can often handle multiple threads concurrently as well.

9. Network contention

Supposedly a distributed filesystem and Spark for in-memory computing make all of your server nodes work together and life is just grand, right? Actually, you still have NIC cards and switches and other things that constrain that bandwidth. NICs can be bonded, and switches are rated for certain numbers of packets per second (this is different than, say, a 1G switch that may not deliver 1G on all 20 ports). So itโ€™s great you have lots of nodes delivering your data in bursts but are they bottlenecking on their own NICs, your actual network bandwidth, or on your switchโ€™s actual available throughput?

Hopefully if youโ€™re out there coding and architecting systems, youโ€™re avoiding these pitfalls. Otherwise, remember that some old timers leave the industry and open bars, so there is always that.