Adobe password leak analyzed with pandas and Python
I recently found out that one of my throwaway email-accounts was in the Adobe password leak. I wanted to see what was out there and downloaded the leaked data. It looks like this:
<some ID>-|-<username, mostly missing>-|-<email>-|-<password hash>-|-<password hint>|-- 103238705-|--|[email protected]|-BB4e6X+b2xLioxG6CatHBw==-|-boyfriend|-- 103238706-|--|[email protected]|-Cm8mAzxAiwzioxG6CatHBw==-|-dance|-- 103238707-|--|[email protected]|-n+TZlu41zyHioxG6CatHBw==-|-|-- 103238708-|--|[email protected]|-FAniAwP+U13ioxG6CatHBw==-|-|-- 103238709-|--|[email protected]|-kxiV+a47bSlf+E5Ulu/AzA==-|-newest|-- 103238710-|--|[email protected]|-UimSy9NunUU=-|-dog|--
Here's the relevant xkcd
I put a small script together to convert the file to a tab-separated file. This is probably not necessary, but I didn't have to think much about it and it would be easier later on. I do all this with in an Anaconda 3.4 installation.
import csv from io import open filename = 'cred.csv' userDict={'idnumber':[], 'username':[], 'emailfront':[], 'emailprovider':[], 'passwordhash':[], 'hint':[]} rowsList = [] with open(filename, "rt") as in_file: with open('converted.tsv','w', newline='') as out_file: csvwriter=csv.writer(out_file, delimiter='\t') for i in in_file: try: text = in_file.readline() temp = dict() if text != '\n': t= text.split('-|-') email=t[2].split('@') temp['idnumber']=(t[0]) temp['username']=(t[1]) temp['emailfront']=(email[0]) temp['emailprovider']=(email[1]) temp['passwordhash']=(t[3].rstrip('=')) temp['hint']=(t[4][:-4]) temp=[t[0],t[1],email[0],email[1],t[3].rstrip('='),t[4][:-4]] except IndexError: #print('oops', text) pass else: csvwriter.writerow(temp)
Next i wrote a small script to check for my email:
filename = 'converted.tsv' email = 'test' with open(filename,'r') as out_file: for line in out_file: if email in line: print(line)
sample output:
12345990 name gmail.com sKZcyHioxGNzioxG6CfCw dog
I like the pseudo code style of python :-). My password hint is dog and I can see my hash. Now my interest was piqued and I wanted to see what I can learn from this dataset. Here, I thought it is a good idea to learn a bit about pandas. I come from a Matlab environment for data wrangling and wanted to see what python has to offer.
Using pandas this is pretty straightforward. Since my computer doesn't have enough memory I used only the first 10000000 entries. I will probably next look into ways to work faster with that much data. Here is the iphyton notebook used:
import time filename= 'converted.tsv' import pandas as pd tic = time.clock() adobeDataFrames=pd.read_csv(filename, nrows=10000000, delimiter='\t', usecols=[2,3,4,5], names=['emailfront','emailprovider','passwordhash','hint']) toc = time.clock() print('[*] Time to read data: ',toc-tic)
[*] Time to read data: 19.532169279833354
print('[*] rowcount: ', len(adobeDataFrames.index))
[*] rowcount: 10000000
The top ten email providers:
print(adobeDataFrames.emailprovider.value_counts()[1:10])
yahoo.com 1285637 gmail.com 937736 aol.com 315863 msn.com 138874 comcast.net 107684 hotmail.co.uk 99280 web.de 83145 gmx.de 65824 sbcglobal.net 64469 dtype: int64
Top 20 password reminders, I was once a usual :-):
print('[*] hints:\n',adobeDataFrames.hint.value_counts()[1:20])
[*] hints: name 53044 ?? 36036 usual 34571 ???? 33202 ??? 25743 me 25246 same 23438 cat 22477 son 18065 daughter 17497 nickname 16957 ????? 15753 ?????? 14079 pet 13315 work 12744 normal 12544 car 12042 my name 11914 love 11381 dtype: int64
Top 20 email names, notice the absence of female names on the list:
print('[*] front of email:\n',adobeDataFrames.emailfront.value_counts()[1:20])
[*] front of email: webmaster 8236 mail 7246 admin 7216 adobe 6471 sales 4874 john 4677 chris 4522 david 4388 mike 4208 mark 3568 contact 3440 paul 3408 steve 3321 macromedia 3194 peter 2850 michael 2828 support 2818 office 2802 dave 2447 dtype: int64
Now we come to an interesting part. Adobe used always the same algorithm to calculate the hash and did not salt the stored hashes. This results in having the same hash for the same passwords. Here we have a list of the top 20 hashes that are connected to the 20 most common passwords.
print('[*] passwordhashes:\n',adobeDataFrames.passwordhash.value_counts()[1:20])
[*] passwordhashes: L8qbAD3jl3jioxG6CatHBw 37431 j9p+HwtWWT86aMjgZFLzYg 23348 j9p+HwtWWT/ioxG6CatHBw 14591 5djv7ZCI2ws 13368 7LqYzKVeq8I 10862 dQi0asWPYvQ 9701 ukxzEcXU6Pw 8474 WqflwJFYW3+PszVFZo1Ggg 7904 BB4e6X+b2xLioxG6CatHBw 6734 diQ+ie23vAA 6726 kCcUSCmonEA 6616 e6MPXQ5G6a8 6311 4V+mGczxDEA 5902 PMDTbP0LZxu03SwrFUvYGA 5873 xz6PIeGzr6g 4743 hjAYsdUA4+k 4493 5wEAInH22i4 4361 rpkvF+oZzQvioxG6CatHBw 4245 j9p+HwtWWT8/HeZN+3oiCQ 4142 dtype: int64
To come full circle to the xkcd comic, we pull the most common hints for the ten most common passwords(hashes). It is left to the reader to solve these.
hash_list=hash_list=adobeDataFrames.passwordhash.value_counts()[1:10].index.tolist() for h in hash_list: print('[*] hash: ',h) print(adobeDataFrames.hint[adobeDataFrames['passwordhash']==h].value_counts()[1:10])
[*] hash: L8qbAD3jl3jioxG6CatHBw pw 282 usual 261 same 225 easy 205 hint 201 word 172 duh 141 wordpass 141 obvious 140 dtype: int64 [*] hash: j9p+HwtWWT86aMjgZFLzYg 123 496 numbers 442 1-9 286 987654321 281 123456 275 number 180 1 122 19 109 12345 96 dtype: int64 [*] hash: j9p+HwtWWT/ioxG6CatHBw 1-8 177 123 136 number 135 numeros 113 87654321 83 1234 74 123456 71 18 57 1 50 dtype: int64 [*] hash: 5djv7ZCI2ws ytrewq 149 q 111 qw 79 asdfgh 67 qwert 65 qwe 63 key 61 qy 47 123456 41 dtype: int64 [*] hash: 7LqYzKVeq8I 222222 120 111 63 123456 62 11 54 one 50 numbers 39 61 37 number 33 6 31 dtype: int64 [*] hash: dQi0asWPYvQ 123 128 numeros 123 1-7 115 7654321 90 123456 81 number 80 1 50 12 47 12345678 47 dtype: int64 [*] hash: ukxzEcXU6Pw Series([], dtype: int64) [*] hash: WqflwJFYW3+PszVFZo1Ggg flash 13 software 8 macro 7 company 6 mac 5 company name 4 programa 4 manufacturer 3 software name 3 dtype: int64 [*] hash: BB4e6X+b2xLioxG6CatHBw 123 45 usual 24 adobe 123 21 site 18 software 18 123adobe 18 name 17 company123 16 Adobe 15 dtype: int64
I will stop here. There is much more information in here. For instance the hash itself can tell you about the password length. An interesting article can be found on naked security. I am very happy with my first pandas experience and definitivly use it again.
Comments
Comments powered by Disqus