Post
Topic
Board Project Development
Re: List of all Bitcoin addresses ever used - currently available on temp location
by
brainless
on 13/01/2021, 17:55:06 UTC
you discussion about sort and remove duplicate , and make list available raw and sorted
my system is i3-6100 processor with 16gb ddr4 ram, and i am managing there all sort and remove duplicate from raw 19gb file within 1 hour, on work daily data is just few min job
let me explain
simple do
sort  raw.txt >> sorted.txt
split -l 50000000 sorted ( it will split filesstarting with name xaa next xab....)
next is remove duplicate by perl for fast and aprox can load 3gb file, but we make it more fast by selecting 50m lines
perl -ne'print unless $_{$_}++' xaa > part1.txt
2nd file
perl -ne'print unless $_{$_}++' xab > part2.txt
last you have compelete all files within 1 hour

now combine all file
cat part*.txt >> full-sorted.txt
or like sorted ( selected all part1.txt... part10.txt )
cat part1.txt part2.txt part3.txt >> full-sorted.txt

stage 2
2nd group you can continuous onword 21 dec 2020, all daily update files, combine, sort and remove duplicate
you name it new-group.txt

command is
join new-group.txt full-sorted.txt >> filter1.txt

here filter.txt is common on 2 files(new-group.txt and full-sorted.txt)
now need remove filter.txt from newgroup.txt for get pure only new addresses

awk 'FNR==NR{ a[$1]; next } !($1 in a)' filter.txt new-group.txt >> pure-new-addresses.txt

stage 3
if you still need all in one file

combine pure-new-address.txt and full-sorted.txt
cat pure-new-address.txt full-sorted.txt >> pre-full-sorted.txt
sort pre-full-sorted.txt >> new-full-addresses


its recomemnded leave 1 file as last created on 21 dec 2020
start 2nd file onword,  perform only stage 2, you will have only new addresses which is no apear in first 19gb file

hope i try to explain all points, and will help you and community , any further info, ask me , love to provide info what ever i have