My indexing journey

The unprepared

From my background as an accountant and controller I had by nature a good feeling for numbers. This helped me to understand, how to structure data.

A couple of years ago when I started to work in a startup, I usually grabbed data directly from source systems to support our cause to tackle financial issues and finding process deficits.

Over time the queries grew in sizes and run-time got longer. When the run-time was not any more satisfying and our excel sheets with tons of formulas and recalculations broke, the need arose to build something more reliable.

First, I created queries and jobs which persisted data in advance. Just a bunch of reports. But we wanted to give these data also to users. So, we started to build a data warehouse with dimensions and facts and giving access to them with a nice tool. My data journey from Kimball Star Schema to Data Vault is for another blog post. I want to focus here on indexes.

My first objects had no indexes. When performance got degraded, I started to investigate. In SQL Server, there is a feature called „Database Engine Tuning Adviser“. I would consider its suggestions and create indexes and statistics.

Time passed and processing time was growing again. At one time, I analyzed my created indexes and thought, what would happen, if I would drop all indexes? Wow, the processing time decreased from 3.5 hours to 2 hours.

Lessons learned: Don’t create too many indexes. Index creation uses also processing time.

Proper Indexing

My next adventure was about clean data. From my background as an accountant, I want to be sure that my numbers are always correct. My nightmare situation is, when data is not correct and numbers, especially financials don’t match.

Sometimes my Dimensions, Facts or Stage tables would create cartesian products (duplicates). The solution was to make sure, that I would never ever get duplicates. I started to use UNIQUE indexes.

The confidence in my data grew, because any error would alert me.

The downside of it was, that when something broke, parts of my whole pipeline stopped at that stage. If the processing time is 6 – 8 hours and users want data at 9, what happens, if my processes breaks in the night?

We set up a shift plan among our team, so that first thing after getting up is, to check emails on the phone if everything runs smoothly. This would save time until my users get their data. Not a satisfying solution, but with indexes breaking the load, what should I do?!?

Lessons learned: Indexes are great for data quality. But can break loading processes.

Exploring Cloud Databases

At a fast-growing startup, available data explodes. We got into a habit to regularly replace our SQL server to accommodate our increasing data need.

We figured out fast, that always increasing the size of our server is not satisfying. At that time Hadoop with MapReduce made its appearance. Just adding commodity hardware. No high-performance expensive single-server anymore.

I’m a SQL guy. I had a trauma replacing all my code somehow with MapReduce.

To leverage that, I started to analyze some cloud offerings such as Redshift and others.

When testing those analytical platforms, I discovered that none of them supported constrains. No primary key, no UNIQUE indexes. Even nowadays Azure SQL Data Warehouse does not support it. Their primal goal and use is to analyze a lot of data fast without any obstacles.

Their answers to my issues was, that the application must make sure, that there will be no duplicates. BI has usually no application.

For my accountant’s heart this was kind of no go. How to make sure that my or my colleagues programming will be always correct?

Lessons learned: Cloud databases are not always the answer.

BI automation

One way to ensure good results is, to make sure that data processing will always be the same. This is also something that Dan Linstedt inspired me, when he showed me the maturity diagram which he translated to a data warehouse. On the highest level you would have:

  • BI automation
  • Parallel Teams
  • Rapid Delivery

This was something I wanted to achieve. I can say now, that I’m able to produce best results for any objects I’m loading data in. Give me more developers and all will produce great data!

Lessons learned: Using a framework to support data processing application is a big win!

No indexes

A year ago when I migrated to SQL Server 2016, I discovered a new index type. Actually, it is rather new table storing type. Instead of a row store, data is now stored in column stores. Think of saving in Excel column A, B, C, etc. instead of Rows 1, 2, 3.

Over time, I added those CLUSTERED COLUMNSTORE INDEXES to all my objects (except stage tables). The immediate effect was, that I saved a lot of disc space. Usually 10 – 20% from the former disc space is used now.

On top of that I still had my NONCLUSTERED indexes. With time and analyzing execution plans I discovered, that SQL Server hardly uses any of my NONCLUSTERED indexes. And when it uses them, the execution plan might include nasty „Nested Loops“ instead of „Hash Matches“ to join 2 data sets together.

With growing confidence in my BI automation templates, I’m now in the situation that my application makes sure, that anybody in my team produces great results and that I can remove any UNIQUE constrains.

The immediate effect was, that our processing speed increased. Why? Data is not anymore loaded into the databases logfile for checking the UNIQUE constrains before inserting the data, but immediately inserting the data without any delay. Like any other cloud database.

I’m now kind of par with the cloud offerings. There is no need to migrate to cloud because of speed issues. Any indexes, especially UNIQUE indexes, are speed blockers. They don’t have them, I don’t have them. Another benefit was disc space. The indexes are using up to 2 times the space of a CLUSTERED COLUMNSTORE table. Saving space and time!

Do I have a safeguard routine? Of course! As the BI automation solution has all the metadata, I regularly query for duplicates.

When I look back, I started with no indexes and ended with no indexes. The difference is, that with BI automation I increased data processing quality and with CLUSTERED COLUMNSTORE a new technology emerged.

From now on I don’t need to grab my phone at first sight in the morning. The reasons for breaks now are usually external dependencies.

Lessons learned: Sleep is a precious good!