Unlike native applications, browser based applications operate within the constraints imposed by the browser. The browser imposes these limitaions, sometimes giving the end users ability to set the limits, to safeguard against rogue websites.
How big data?
Modern browsers use a separate process for each tab. Even though the machine may have a lot of RAM, each browser tab can only consume a certain amount of RAM. For example Chrome has a 2 GB limit on 32-bit and 4 GB limit on 64-bit machines. Track Issue 416284 to see if and when this limit will be lifted or given control to the end user to adjust.
SQL Frames is architected to efficiently make use of the memory. As data is transformed using various DataFrames, the internal data structures are shared as much as possible.
Starting version v0.12.7, it is possible to load a CSV file of raw size up to 1 GB. Depending on the average size of a row, this can be typically from 5 to 10 million rows.
The sub-second interations are mostly possible for up to 1 million rows. However, the responsiveness of even larger datasets is still better than doing similar operations with Excel, Numbers or Google Sheets.
How many rows?
There is no single answer to this and it depends on many factors. In general, SQL Frames is designed for 10s of millions of data. However, due to the memory constraint, the average size of each row determines the total number of rows that can be loaded into memory. A skinny table can accommodate more rows than a wide table. Also, high-cardinality dimensions and ad-hoc fields like notes or comments can take up a lot of space. So, it is best to avoid loading fields that are not utilized for end analysis.
It is possible to bring a subset of data for analysis but provide the entire record details by fetching it from the server as needed. This allows the user to do the high-level summary analysis and explore the data and only when specific record needs to be fully explored, a simple REST API an provide the necessary data.
DOM & Rendering
HTML Markup is internally rendered using a data structure called DOM. Due to all the flexibility of HTML, DOM is a heavy weight model. Making changes to large amounts of DOM can be very expensive. SQL Frames uses PREACT, a UI framework similar to React and makes use of Virtual DOM (VDOM). This makes the rendering process as efficient as possible. Still, rendering large amounts of data can be challenging. SQL Frames supports both row and column virtualization. That is, it only displays what is being visible rather than everything. So, while a DataFrame with 1000 columns and 1 million rows can be very expensive to render all at once, SQL Frames can easily handle rendering of such large and wide tables because only the rows and columns visible in the viewport are rendered.
How many columns
SQL Frames supports multi-dimensional pivot tables where both the row and column axis can have multiple fields. This can potentially result in very wide pivot tables. Since SQL Frames provides column virtualization, only data belonging to columns that are currently visible are rendered. As a result of this, SQL Frames is able to easily work with 1000s of columns. This helps users work with wide pivot tables.
While SQL Frames can be used to only do data transformations and the final result can be rendered using a 3rd party UI component, it is highly recommended to make use of the data grid UI provided by SQL Frames because of all the complex optimizations that are researched and implemented to provide a smooth UX even when working with large amounts of data.
Browser based analytics is best suited when user can download a large amount of data and do a lot of subsequent data analysis and exploration.
The initial loading of data depends on
- The server time to serve the data. A query from DB can be more expensive than reading from a file extract.
- The network latency for streaming all the data. Techniques such as compression can be adopted for better performance.
- The time to parse and process the data in the browser.
Each of these steps depend on the number of fields and size and data type of individual fields.
CSV vs JSON
Browser have native JSON parser but not a native CSV parser. As a result, JSON processing is generally faster. However, JSON can be verbose resulting in larger streaming sizes. This can be avoided by using a compressed JSON formats. JSON also requires fitting the entire data within the browser as the entire JSON response needs to be held in the memory. CSV parsing on the other hand makes use of a worker with streaming which helps with the browser being responsive while the CSV is being parsed.
Bringing only the data that is needed for analysis can reduce server processing cost, network streaming cost and the cost of processing within the browser.
SQL Frames is designed for rapid data exploration with sub-second latency response times as there is no network roundtrip involved after the initial data is loaded. Many data transformation algorithms are designed for sub-second response times when the dataset is no more than a million records. When working with larger datasets, more latency is expected. However, it may still be faster than having to go back to the server and get the results (network cost and server cost which can be high during peak).
Certain operations such as recursive hierarchy flattening are much slower. In addition, different SQL queries have different execution plans and hence different processing times.
While SQL Frames works on Chrome, Safari, Firefox and Microsoft Edge browsers, best performance is often observed with Chrome (and Microsoft Edge which is based on Chromium). Hence Chrome browser is recommended when working with large datasets.
SQL Frames has been internally tested against TPC-H (@ 0.1 scale). Based on the benchmark numbers provided by duckdb, SQL Frames has better performance for most of the queries compared to those published for sql.js, Arquero and Lovefield. While the duckdb WASM implementation beats most implementations by a wide margin, SQL Frames focuses on providing the integrated UX of both data transformation and visualization. We believe this is more important for in-browser data analysis and exploration than the additional performance boost due to WASM and Columnar format.
Always make sure to benchmark your specific use cases. If you believe the system is much slower (an order of magnitude) compared to similar operations performed by a Database, Excel or Python Pandas then let us know and we can investigate.