Opening very large CSV files
CSV files are one of the most common formats used to share data. Yet, opening very large files of this file format can be challenging for many applications. Since CSV formats are commonly opened with Excel on Windows and Numbers on Mac, one would expect these applications to open these files very fast. Surprisingly, they are extremely slow. There are many utility apps out there that provide various levels of functionality to manage large CSV files.
CSV is one of the common file formats for SQL Frames as well. However, since SQL Frames is written in TypeScript (which compiles to JavaScript) and JavaScript has a maximum limit for the size of a string, there can be challenges dealing with large CSV file sizes.
The CSV parser used by SQL Frames is Papa Parse. This library has several strategies for reading CSV data. It can read a CSV file from a URL in which case there is no limit to the file size as it is smart enough to work in chunks (including the ability to retrieve chunks from the server with range fetches). It can also read a CSV file using HTML File object and the underlying Blob interface allows getting the data as an Array Buffer. As a result, the maximum string size will not be an issue but most browsers still have limits on the size of the Blob (which is usually larger than that of the string).
There is one other use case that posed a challenge. It is the ability to use SQL Frames within ElectronJS. Within ElectronJS, a browser window is launched which loads SQL Frames and it behaves just like a web interface. However, in order to provide native experience, one of the options chosen was to launch the native file dialog instead of the standard HTML file dialog. As a result, the file was being read using the NodeJS API and the entire string was being returned for processing. So, the first challenge was when processing a file that is more than about half a GB. It didn't read the entire file. It is possible to read the file as array buffer instead of a string. However, it still meant loading the entire array into memory which poses a different set of challenges. There are newer API that allows streaming but Papa Parse is currently not implemented to make use of the Web Streaming API. However, it does have support for ReadableStream of NodeJS. However, the challenge is to make use of this capability from within the browser context.
So, in order to work around this problem of dealing with large files, a File object is passed that pretends to be a NodeJS stream with only the fields and methods that are required by Papa Parse. The resulting code turned out to be very small as shown below.
const methods = new Map();
let stream = {
name: file,
readable: true,
read: () => {},
pauase: () => {},
resume: () => {},
on: (evt: string,method) => {
if(['data','end','error'].indexOf(evt) < 0) return;
methods.set(evt,method);
if(methods.size !== 3) return;
electron.streamFile(file,methods.get('data'),methods.get('end'),methods.get('error'));
},
removeListener: (evt) => { methods.delete(evt); }
}
The above code feels a bit hackish that it is trying to do too much work in the 'on' function. Apparently, NodeJS streams have a flow mode when there is an on-data listener which essentially means that it will start streaming the data automatically. Since there is no explicit control from Papa Parse to trigger this, it has been implemented this way where after making sure the three listeners have been setup, it will start streaing the file. The electron.streamFile
is something that has been setup via context bridge and is safe.
With the above code, it is now possible to read very large CSV files without running into memory challenges (within the limits of ElectronJS browser windows).
Ability to read very large JSON files is still pending and it may become yet another post if it turns out to be as interesting solution as this one.
Sometime in the new feature an ElectronJS app that leverages SQL Frames will be released. If you are interested to learn more, please feel free to contact us.