Opening very large CSV files in the browser
Last week I talked about opening a very large CSV file within ElectronJS based app. That post talked about how to work around the 500MB limit of a string in JavaScript for Chromium browser. I decided to push the boundary so that even larger CSV files can be opened. It turns out it is possible to open even larger files and the same techniques can be used in both ElectronJS (or NodeJS) but also within the browser.
Eventually, SQL Frames managed to load more than 1GB file with 10 million records first in ElectronJS and then within the standard Chrome browser itself. Here is how it was done.
DEMO
If you have a large CSV file handy, see if it can be processed within the browser using the following UI.
Your file is completely processed within the browser and is not uploaded to any server.
Immutability and substrings
In JavaScript, strings are immutable. See the following code to understand this.
let x = 'abcdef';
x[1] = 'z';
console.debug(x);
> 'abceef' // the result is still 'abcdef' because x can't be mutated
See Mutable for more details about mutablity in JavaScript.
When someone needs a substring of a given string, it is possible to create a brand new string containing the substring value or make use of the original string as the backing store for the substring by just keeping track of the start and end positions. This later optimization works well in some cases and V8 engine makes use of this optimization. The problem with this optimization is that if the original string is several MB but the final substring is only a few bytes, it still retains the original string in memory costing a lot of memory.
When parsing a CSV file, the file is first read and the then processed in chunks. When processing 1 GB file, each chunk can be potentially a few MB and all of these raw strings are retained in memory if the corresponding fields just use substring function to get their values. This is what Papa Parse and most programs do.
Lot of reallife data has a lot of repeating values. String and date dimensions and even many numbers repeat quite a bit. Imagine each of these unique values are tracked and all the repeations are substituted with these unique values. This can drastically reduce the size requirements. However, by using the substring, we lose the ability to make use of a smaller string. While there is no fix from V8 yet, one workaround is the following
let x = 'SQL Frames rocks!';
const sqlframes = (' '+x.substring(0,10)).substring(1);
What we are doing is to take a substring, prepend it with a blank character which forces creation of a new string and then getting a substring out of it by discarding the first character.
Web Workers
These optimizations almost worked except it failed when using web workers with Papa Parse. In case of ElectronJS web worker was not possible because of some object cloning issue and so web workers were not used but was able to parse 1+ GB file with 10 million rows.
However, this technique didn't work in the browser and I almost assumed it that it won't be possible to fix. However, I noticed that by disabling web workers the browser behavior was indeed indentical to the ElectronJS version and could successfully parse the same large CSV file.
Upon investigation it turns out, when using web workers, it is not possible to control the incoming stream of data. Papa Parse has the ability to pause/resume but it doesn't work with web workers. Because of this, the entire file is being read into memory. Ironically, you typically need a web worker when dealing with a large file as it takes more time to process but it is also in this scenario that it can cause memory problems! For the moment, SQL Frames decided to disable web workers by default.