Search Performance Optimizations

Jun 10, 2020

3 min read

Searching in Panther Enterprise is now 10x faster and uses 60% less storage with Automatic Log Compaction and Columnar Data Loading

Why?

Panther Community stores all log data as gzipped compressed JSON files in AWS S3. While JSON is super flexible, it also requires complex parsing and consumes considerable storage space compared to binary file formats. In large volumes, JSON files can be slow to search and expensive to store.

In Panther Community’s architecture, JSON files are used directly by Amazon Athena for historical queries. This has two notable limitations:

Athena searches frequently fail when queries are run across large numbers of files; and
The JSON files need to be parsed which can result in slow searches

How does it work?

To overcome these limitations, Panther Enterprise allows coalescing of log files to the optimal number per-hourly partition. This results in fewer queries and faster searches.

JSON to Parquet

Panther Enterprise converts log data from JSON to Parquet. With Parquet’s binary column-oriented format, Athena only reads data from the columns being queried. In contrast, with JSON, Athena needs to read a full record to select the columns in the query. Moreover, Parquet typically results in ~60% smaller file sizes, meaning you also pay less for storage.

Why Column-Oriented?

While row-orientated databases are efficient for transactional data stores, the Panther data warehouse is built for real-time analytics, where column-oriented data storage allows for greater efficiency.

Modern data warehouses store a ton of information. Every single row of data may have hundreds of associated columns. In a column-based approach, query processing doesn’t need to parse the columns that are unnecessary for operator evaluation. Additionally, column-oriented databases offer better compression ratios and greater utilization of parallel processing capabilities, leading to significant performance and cost advantages for your security analytics.

How does this impact you?

Query failure rates will decrease: With smaller file sizes and more efficient searches, queries in Panther Enterprise will perform better compared to the same queries in Panther Community.
Queries are cheaper: One facet of AWS billing is the amount of data scanned. Less data is scanned with column stores, meaning lower bills!
Queries are faster: Column-oriented data stores result in queries that are 10x faster.
Queries are more efficient: Take advantage of more efficient S3 data storage relative to gzipped JSON.

Panther’s compacted data is compatible with most ‘data lake’ tools like Athena, Spark, EMR, and SageMaker.

Most importantly, there’s nothing to tune or configure. Simply upgrade to Panther Enterprise and your queries will fly while your storage costs fall.