电脑小白: April 2022

TL;DR

The 8 tiktok json files are split, so you need `cat` to concatenate them to single xz file in order to extract files other than 001.

Then use jq to convert one-liner data file to json formatted file.

[1] cat tiktok.json.xz.00{1..8} > tiktok.json.xz.full; # bash

[2] Extract it in nautilus. Be patient, output size keep increasing until 176GB.

[3] <data jq '.' > tiktok.json

How I research from scratch:

I try jq on file tiktok.json.xz.001 extracted data file first, it failed with error at the end:

parse error: Unfinished string at EOF at line 1, column 301400064

I need to know why jq failed first before proceed other huge file.

I view the jq output last video id is 62077733563793408, and use this id to compare original data file.

I use python to read and split that id because normal utility `less`(-n better though) can't handle effectively such very long line.

I knew it failed ~10 lines only.

So the reason jq failed is because of No closing "}" something which is make sense because it seems like split.

Last characters of 001 data file:

"playAddr": "https://v19-web-newkey.tiktokcdn.com/79eec1166f8ae077c86dd37a14d70288/5f84ddc0/video/s3/mp/s3-mp-v-0068/reg02/2016/02/08/08/15804438-6c1b-4a7a-8e35-f04687699854.mp4/?a=1988&br=0&bt

Before I try jq huge file(188.6 GB concatenated data file), I want to prove the files are continuation otherwise I waste my time on jq parse error.

So the next thing to prove is that 002 file really continue 001 file.

I need extract the beginning part of 002 file from concatenated extracted data file to compare.

The normal command such as cut is heavy, so I try to use "low level" command `dd`.

The 001 data file is 26071203840 bytes (ls -la to get size, you don't use jq_001.out which already parsed).

Then full data file simply round a bit within 10MB range to 26070000000 bytes. Then extract the total 10000000 bytes (10MB).

dd if=full_data bs=1 skip=26070000000 count=10000000 of=skip_data

Again with python, `r = f.read()`, id `'62077733563793408' in r` is True.

Then simply split(only 2 indexes) by '62077733563793408' and print r[1].

Then I can see `=0` continue `&bt` (last 3 characters of 001 data file), which proved that 002 has correct opening bytes to be able continuously parsed by jq on concatenated full data file:

It means that safe to proceed `cat`, extract, and `<data jq '.' > tiktok.json`.

Be patient because it take times (data file 176GB, it took me 1 hour 8 minutes 40 seconds on jq, you can try 001 2GB file first to have expectation time of 14.1GB files).

After completed (195GB output file), I also use same step to compare to proved that the final output json item of parsed file same as data file.

电脑小白

Tuesday, 19 April 2022

How to open tiktok leaked(scraped?) db

Followers

Blog Archive