A large dataset with a million rows lands in the data room. You load it to Excel and your laptop grinds to a halt. What can you do?
Welcome to a new world where data is large. Excel tries but it can’t quite get you there. Technically the program can take 1,048,576 rows and 16,384 columns, but in practice anything larger than a hundred thousand rows really impair performance.
This article outlines 3 possible options for you to sidestep the problem.
Use a smaller sample
Instead of working with millions of records, ask for something like 50,000 records. As long as these records are randomly selected, they should technically be representative of the entire customer database.
There is a major downside to this solution. While the records should technically be representative, if there are 10 records per customer on average than 50,000 records yields only 5,000 customers. This sample size might not capture the breadth of customer behavior, especially if there are many different types of behavior.
While statistical tests may still be valid, reducing the data size increases the variability in estimates. With so much at stake, this is not a great option.
Use aggregated data
Even if a database is large, you can still work with it if it has been aggregated for you. This could be done by the management team, an in-house engineering team, or an external vendor like DavinAI.
The upside is that aggregate data can provide KPIs based on 100% of the database with no information loss. Depending on the level of aggregation, it should become of workable size in Excel. The degree of aggregation is key. For example, yearly aggregation may be Excel-friendly where monthly cohort aggregation by business units can still be challenging. However, generally aggregation helps significantly in Excel.
This approach works best when key questions are clear from the get-go. Since there is turnaround time to put aggregations together, any new exploration, modified KPIs, or cuts of the data can cause delays in the results.
Use modern analysis tools
Several tools exist for accessing larger datasets. They are split into two main groups:
Code-free tools like Tableau, PowerBI, and Alteryx rely on click and drag functionality. These work through the single digit millions but grind after that. While there are some technical naming conventions, there can be minimal coding involved. I say “can” because despite the name to use more advanced functionality some SQL and other coding knowledge is needed. These coding challenges are generally equipped with training wheels for the average business analyst and don’t present a major problem.
Code-based tools like Python, R, and SQL rely on languages. Hooked up to databases like Redshift, these tools can process several millions of rows of data with minimal processing time. Once we get into the billions of rows, a tool like Snowflake or Spark is probably needed, especially if tables are being joined and aggregated.
There are a number of advantages with code-based tools. Some datasets can take several days just to ingest if the data comes in forms like nested jsons or multiple Excel files. In those cases, it might be impossible to use a simpler code-free tool like Tableau or Alteryx. Code-based modern data processing platforms can be utilized instead in-house or by an external vendor like DavinAI. At DavinAI, we have purpose-built libraries that process large datasets, apply statistical models, and quantify value using machine learning—all in the diligence timeframe.
Which approach should you choose?
The right approach depends on the complexity of the dataset and the depth of the questions to be answered. Reach out if you have questions; we’re happy to discuss the best solution for your data and questions.