My requirement is to load a gz file from gcs to BQ. I am using python and Airflow but it is throwing an error that if a single compressed file size is greater than 5 GB, then we can’t load it. I have tried bash operator with split function, but the files which I am getting is not having the data which is present in the Actual file. Split files are having some junk data. I want the split files also in gz format.
Input file – test.gz
gsutil cp gs://test/test.gz - |
split -b 1G -z - /tmp/split_file_ |
gsutil cp /tmp/split_file_* gs://testing/
What I am missing? Or is there any faster/efficient way?