Various modern applications of computer science and machine learning use multi-dimensional data sets that comprise a single extended coordinate system. Two examples are using air measurements over a geographic grid to estimate the weather, or predicting medical imaging using multi-channel image intensity values from a 2D or 3D scan. Such datasets can be difficult to work with because users can receive and write data at unpredictable intervals and at different scales, and they often want to run studies on multiple workstations simultaneously. Even a single record can require petabytes of storage space under these circumstances.
Basic technical problems in scientific computing related to the management and processing of huge data sets in neuroscience have already been solved with Google’s TensorStore. TensorStore is an open source C++ and Python software library developed by Google Research to address the problem of storing and manipulating n-dimensional data. This library supports multiple storage systems like Google Cloud Storage, local and network file systems, etc. It provides a unified API for reading and writing different array types. With strong guarantees of atomicity, isolation, consistency and durability (ACID), the library also provides read/writeback caching and transactions. Optimistic concurrency ensures safe access from different processes and computers.
A simple Python API is available through TensorStore to load and work with huge arrays of data. Arbitrarily sized underlying datasets can be loaded and manipulated without storing the entire dataset in memory, since no actual data is read or held in memory until the exact slice is requested. This is possible with indexing and manipulation syntax essentially identical to that used for NumPy operations. Additional advanced indexing features supported by TensorStore include transformations, alignment, broadcasting, and virtual views (data type conversion, downsampling, on-the-fly arrays).
Large numerical data sets require a lot of computing power for processing and analysis. Typically, this is achieved by parallelizing operations between large numbers of CPU or accelerator cores spread across multiple devices. Therefore, a key goal of TensorStore was to enable parallel processing of individual data sets while maintaining high performance (i.e. reading and writing to TensorStore does not become a bottleneck during computation) and security (by preventing corruption or inconsistencies due to concurrent access patterns). to maintain. . TensorStore also has an asynchronous API that allows a read or write operation to continue in the background. At the same time, a program handles other tasks and customizable in-memory caching (which alleviates slower memory system interactions for frequently accessed data). Optimistic concurrency ensures the safety of parallel operations when many machines are accessing the same dataset. It maintains compatibility with various underlying storage layers without severely impacting performance. TensorStore has also been integrated with parallel computing frameworks such as Apache Beam and Dask to enable distributed computing with TensorStore compatible with many current data processing workflows.
Exciting use cases of TensorStore include PaLM and other sophisticated large language models. These neural networks test the limits of computing infrastructure with its hundreds of billions of parameters while demonstrating unexpected capabilities in natural language creation and processing. The efficiency of reading and writing the model parameters presents a difficulty during this training process. Although training is distributed across numerous machines, it is necessary to routinely save parameters to a single checkpoint on a long-term storage system without slowing down the training process. These issues have already been fixed with TensorStore. It has been coupled with frameworks such as T5X and Pathways and used to control checkpoints connected to large (“multipod”) models trained with JAX.
Brain mapping is another intriguing use case. Synapse resolution connectomics aims to trace the intricate network of individual synapses in animal and human brains. This requires petabyte-sized data sets generated by imaging the brain at extremely high resolution and fields of view up to millimeters or more. However, given that they require millions of gigabytes to store, manipulate, and process data within a coordinate system, current datasets present significant storage, manipulation, and processing problems. Because Google Cloud Storage serves as the underlying object storage system, it has been TensorStore is used to solve the computational problems posed by some of the largest and most popular connectomic datasets.
For starters, Google Research has provided the TensorStore package that can be installed with simple commands. They have also published several tutorials and API documentation for further reference.
Github: https://github.com/google/tensorstore
Reference article: https://ai.googleblog.com/2022/09/tensorstore-for-high-performance.html
See the tutorials and API documentation for usage details.
Please Don't Forget To Join Our ML Subreddit
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about machine learning, natural language processing and web development. She enjoys learning more about the technical field by participating in several challenges.