Tuesday 19 April 2022

How to open tiktok leaked(scraped?) db




TL;DR


The 8 tiktok json files are split, so you need `cat` to concatenate them to single xz file in order to extract files other than 001.

Then use jq to convert one-liner data file to json formatted file.

[1] cat tiktok.json.xz.00{1..8} > tiktok.json.xz.full; # bash

[2] Extract it in nautilus. Be patient, output size keep increasing until 176GB.

[3] <data jq '.' > tiktok.json


How I research from scratch:

I try jq on file tiktok.json.xz.001 extracted data file first, it failed with error at the end:

    parse error: Unfinished string at EOF at line 1, column 301400064

I need to know why jq failed first before proceed other huge file. 

I view the jq output last video id is 62077733563793408, and use this id to compare original data file.

I use python to read and split that id because normal utility `less`(-n better though) can't handle effectively such very long line.

I knew it failed ~10 lines only. 

So the reason jq failed is because of No closing "}" something which is make sense because it seems like split. 

Last characters of 001 data file:

    "playAddr": "https://v19-web-newkey.tiktokcdn.com/79eec1166f8ae077c86dd37a14d70288/5f84ddc0/video/s3/mp/s3-mp-v-0068/reg02/2016/02/08/08/15804438-6c1b-4a7a-8e35-f04687699854.mp4/?a=1988&br=0&bt

Before I try jq huge file(188.6 GB concatenated data file), I want to prove the files are continuation otherwise I waste my time on jq parse error.

So the next thing to prove is that 002 file really continue 001 file. 

I need extract the beginning part of 002 file from concatenated extracted data file to compare.

The normal command such as cut is heavy, so I try to use "low level" command `dd`.

The 001 data file is 26071203840 bytes (ls -la to get size, you don't use jq_001.out which already parsed). 

Then full data file simply round a bit within 10MB range to 26070000000 bytes. Then extract the total 10000000 bytes (10MB).

    dd if=full_data bs=1 skip=26070000000 count=10000000 of=skip_data

Again with python, `r = f.read()`, id `'62077733563793408' in r` is True. 

Then simply split(only 2 indexes) by '62077733563793408' and print r[1]. 

Then I can see `=0` continue `&bt` (last 3 characters of 001 data file), which proved that 002 has correct opening bytes to be able continuously parsed by jq on concatenated full data file:



It means that safe to proceed `cat`, extract, and `<data jq '.' > tiktok.json`. 

Be patient because it take times (data file 176GB, it took me 1 hour 8 minutes 40 seconds on jq, you can try 001 2GB file first to have expectation time of 14.1GB files).

After completed (195GB output file), I also use same step to compare to proved that the final output json item of parsed file same as data file.