python - Go through tar archive in memory to extract metadata? -


i have several tar archives need extract/read in memory. problem each tar contains many zip archives , each contain unique xml documents.

so structure of each tar follows: tar -> directories-> zips->xml.

obviously can manually extract single tar have 1000 tar archives 3 gb each , contains 6000 zip archives each. i'm looking way handle .tar archives in memory , extract xml data of each zip. there way this?

this should doable, since of relevant methods have non-disk-related options.

lots of loops here, let's dig in.

for each tar archive:

  • tarfile.open open tar archive. (docs)
  • call .getmembers on resulting tarfile instance list of zips (or other files) contained in archive. (docs)

for each zip within tar archive:

  • once know member file (i.e., 1 of zips) want through, call .extractfile on tarfile instance file object zip. (docs)
  • instantiate new zipfile.zipfile file object in order open zip can work it. (docs)
  • call .infolist on zipfile instance list of files contains (including xml files). (docs)

for each xml file within zip:

  • call .open on zipfile instance in order file object of 1 of xml files. (docs)
  • you have file object corresponding 1 of xml files. whatever want it: .read it, copy disk somewhere, stick in elementtree (docs), etc.

Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -