It’s not everyday you see a data format make a splash, so I decided to take a brief look at Apache Arrow.
Simply put, we run out of CPU memory a lot when working with big data. Why? Data is too big to load into the CPU in one chunk, data’s representation in the memory is inefficient, we have to load all the data to memory to look into parts of the data etc. There are lots of issues that make analyzing or consuming big data on computers very difficult.
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.
Well, that’s a mouthful. Let’s try to digest what those mean one by one to understand why it’s amazing.
Without Apache Arrow, each language or library has their own in memory representation. The same data would be represented differently in Spark memory than if you load the same data with Pandas or Numpy.
With Apache Arrow, you have the same in-memory representation between different libraries and even different languages. We will see why this is a game changer in a little bit.
There is not a lot to digest here if you are already familiar with columnar data formats such as Parquet or ORC, this is just the in-memory version. Columnar formats were one of the big innovations of the big data revolution, so I highly recommend reading about them if you are not familiar. As far as Apache Arrow is concerned, using columnar memory format is a no-brainer. Here is why.
Arrow is trying to make running analytics on data faster and more enjoyable. Analytical queries usually read a small subset of the columns for a large number of rows at a time. This means if we store the data in a columnar format, we would be able to read huge chunks of it at a time and make optimizations on them (more on this later!).
Arrow supports more complex data types than alternatives, allowing optimized analytical performance on nested structured data as well.
Columnar data layout on memory means we can optimize operations with Single Instruction Multiple Data (SIMD) optimizations. The same analytic function can run on multiple data points simultaneously allowing very fast algorithms to be implemented on top.
SIMD parallelization is possible on modern Intel chips with Arrow. Similar to SIMD instructions to parallelize processing, analytic operations can also be parallelized on GPUs. Arrow has CUDA integration to allow data scientists to run their queries way faster benefiting from GPU parallelization.
Currently, when we need to read a specific information from stored data, it usually means copying the whole data to memory first and searching for the specific information we are interested in.
Whether Arrow is placed in RAM or POSIX memory or disk, the cost of accessing a single value in a single column is not dependent on the size of data. This means that the cost to access any part of the dataset is effectively zero.
This is particularly important when there are multiple processes. Two processes can access the same dataset on disk without any additional overhead. A dataset can be streamed and analyzed without any additional processing overhead.
When we share data between systems, or even processes within the same machine, most of the compute time spent is in serializing and deserializing data. Arrow allows the data to be shared directly via handshake and without any copying with it’s built-in IPC framework.
Arrow seems to be poised for becoming the de-facto in memory data framework for analytical tools. It also seems to be a great fit for machine learning framework’s dataset APIs.