Data Generator

My colleague and friend Mauricio (@mvgiacomello) shared a useful tip on Data Generation.

There is a web application that can generate test data based on a set of parameters. It can export the data on HTML, Excel, CSV, SQL and XML.

Their website has a demo that can generate up to 200 records. If you need more, you can download the code and deploy to your own server!

Here is what they say:

Ever needed custom formatted sample / test data, like, bad? Well, that’s the idea of the Data Generator. It’s a free, open source script written in JavaScript, PHP and MySQL that lets you quickly generate large volumes of custom data in a variety of formats for use in testing software, populating databases, and scoring with girls.

You can check it out at: http://www.generatedata.com/#generator

Performance Anti-Patterns in Database-Driven Applications

Nearly every modern application relies on databases for data persistence. The database access layer is very often responsible for serious performance problems. In the case of database problems most people start searching in the database itself. Appropriate indexes and database structures are vital for achieving adequate performance. Often, however, the application layer is responsible for poor performance or scalability problems.

The application layer controls and drives database access. Problems at this layer cannot be compensated in the database itself. Therefore the design of adequate data-access logic is vital for achieving performance and scalability. While there a nearly endless different use cases for database-driven applications, the problems can be nailed down to a small set of anti-patterns. Analyzing whether your application implements the following anti-patterns and resolving them will help to easily implement faster and more scalable software with minimal additional effort.

Misuse of O/R Mappers

O/R mappers have become a central part in modern database applications. O/R mappers take away the burden of translating and accessing relational data from object-oriented software. They hide great parts of the complexity of data access from the application programmer. This results in higher productivity as the developer can concentrate on the actual application logic rather than infrastructural details. Complex data graphs can be easily navigated at the object-relational layer without seeing what is going on under the hood. This often creates the wrong impression that these frameworks take away the burden of designing data-access logic.

Often developers think that their data-access framework will simply do things right; however, using O/R mapping frameworks without understanding their inner workings in many cases results in poor application performance. There are two central misunderstandings that cause these problems – loading behavior and load time.

O/R mappers load data on a per-object base. This means when an object is requested or accessed the necessary SQL statements are created and executed. This principle is very generic and at first sight works well in most situations. At the same time it is very often the source of performance and scalability problems.

Let’s take a simple example. In a database for storing address information, we have one table for persons and one for addresses. If we want to get the name for each person and the city they live in, we have to iterate over the persons and then access the address information. The image below shows the result if the out-of-the box query mechanisms are used. This simple use case results in a high number of database queries.

This directly brings up the second important detail of O/R mappers – load time. O/R mappers – if not told otherwise – try to load data as late as possible. This behaviour is referred to as lazy loading. Lazy loading ensures that data is loaded as late as possible with the goal to perform as few database queries as possible and at the same time avoid unnecessary creation of objects. While this approach is generally feasible, it may result in serious performance problems and so-called LazyLoadingExceptions on accessing data that has not been loaded when no database connection is present.

In situations like the one described above data loading and at the same time performance can be significantly improved by using specialized data queries.

So while O/R mappers can be of great help in the development of data access they still leave the burden of designing proper data access logic. Dynamic architecture validation with tools such as dynaTrace can be of great help here to identify performance weak points in the application and proactively resolve them.

Load More Data Then Needed

Another anti-pattern in database access that can often be found is that much more data is loaded that actually needed. There are a number of reasons for this. Rapid Application development tools provide easy ways of linking data structures to user interface controls. As the data layer is built of domain objects, they very often contain much more data than is actually visualized. The example uses the address book scenario again. This time the names of the persons and their home cities are visualized. Instead of just loading these three items both objects – addresses and persons – are loaded. This results in massive overhead at the database, network and application level. The usage of specific queries can help to massively reduce the amount of queried data. This performance improvement however comes along with additional effort for maintenance. Adding a new column to the table might require several changes to the data access layer.

This anti-pattern can also be found very often in case of improperly designed service interfaces. Service interfaces are often designed to be generic enough to support a large number of use cases. This has the advantage that services have small contracts which can be used in a wide variety of use cases. Additionally uses cases change faster than the backend service implementations. This can result in services interfaces being inappropriate for certain scenarios. Developers will then have to use workarounds which might result in highly inefficient data access logic. This problem often arises in data-driven Web Services.

In order to overcome these problems data access patterns should be continuously analyzed during development. In the case of agile development approaches, data access logic should be checked for each finished user story. Additionally data access patterns should also be analyzed across application use cases to understand data access logic to be able to optimize data access logic according during development.

Inadequate Usage of Resources

Databases are a bottleneck for resources in applications, so they should be used as little as possible. Very often too little attention is paid to the usage of database connections. As with any shared resource such connections massively affect overall system performance. Specifically, web applications and applications using O/R mapping frameworks with lazy initialization tend to keep database connections longer than needed. Connections are acquired at the beginning of processing and kept until rendering is finished or no further data access is required. In applications using O/R mappers, they are often kept to avoid nasty lazy initialization problems. By redesigning data access logic and separating it from post-processing (like rendering), the performance and scalability of an application can be dramatically improved.

The chart below shows the response time of ten concurrent data processing threads. In the first part one database connection is used. In the second scenario ten connections are used. In the third scenario two database connections are used but two thirds of the processing is performed after having returned the connection. With better designed data access the third scenario nearly achieves the same performance with a tenth of the resources.

One Bunch of Everything

One Bunch of Everything is an anti-pattern that can generally be observed in development but even more in agile teams. The characteristic of this anti-pattern is that primarily features are developed and all data access is treated equally, as if there would not be any differences. However treating different types of data and queries differently can significantly improve application performance and scalability.

Data should be analyzed regarding its lifetime characteristics. How often does it change or if it is modified or only read. Access frequency of data, together with access patterns, provides hints on potential sources for caching. Access frequency also provides hints as to where optimizations make the most sense. This avoids premature and unnecessary optimization and guarantees the highest impact of performance tuning.

Analyzing usage patterns of data also helps to tune the data access layer. Understanding which data is really used helps to optimize loading strategies. Understanding how users browse search results, for example, helps to optimize fetch sizes. Knowing whether users look at order details helps to select lazy or eager loading for order positions.

In addition to data, queries should also be analyzed and categorized. Important factors are query duration, execution frequency and whether they are used in an interactive user context or batch-processing scenario. Transactional characteristics further help to fine tune isolation levels of queries.

Running short-running interactive queries of users and long-running reporting queries on the same connection for example may easily result in bad end user experience. Long-running reporting queries can greedily acquire database connections leaving end-user queries starving. Using different connection pools for different query types results in much more predicable end user performance. Softening isolation level on database queries where they are not required can also lead to significantly improved performance and scalability.

Bad Testing

Finally, missing or improper testing is one of the major reasons for performance and scalability problems in database-accessing applications. I recently gave a talk on this topic and asked the audience whether they see database access as a performance problem in their applications. While all of them agreed, nobody had testing procedures in place to test data access performance. So while it seems to be an important topic, people do not seem to invest in it.

However, even if testing procedures are in place, this does not necessarily mean that testing is done correctly. Although a lot of problems in data access logic can be found right after the code has been developed, testing is performed much later in the load testing phase. This introduces unnecessarily high costs as changes are performed late in the lifecycle possibly requiring architectural changes leading to additional development and testing efforts.

Furthermore test cases have to be designed to test real world data access scenarios. Data access has to be tested in a concurrent mode and using different access types. Only combined read/write access helps to identify locking and concurrency problems. Additionally adequate variation of input data is required to avoid unrealistically frequent cache hits.

Very often people also do not know for which load to test as they have no adequate information on expected load. In my experience this is very often the case – unfortunately. This, however, is not an excuse for not defining load and performance criteria. It is still better to have some criteria defined instead of not defining them at all.

In case you really have no clue on performance characteristics the best approach is to use load testing criteria with increasing load until the saturation point of the application is reached. Then you have identified the peak load of the application. If this sounds reasonable and realistic you are on a good way. Otherwise you know where you have to improve performance. In most cases initial tests show that application can cope with much less load as expected.

Conclusion

Database access is one of the most critical areas impacting performance and scalability in modern applications. While frameworks support in building data access logic, a serious amount of thought still has to be put into the design of data access logic to avoid pitfalls and problems. The key is to understand the details of the dynamics and characteristics of an application’s data-access layer.

Via InfoQ