爬取360DGA域名家族数据集

为了做课程大作业,需要获取360DGA域名家族数据集。于是写了段很low的爬取数据集的代码,有时间还得学习一波正则表达式和爬虫了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import requests
import urllib.request
import bs4
import re
url = 'http://data.netlab.360.com/dga/'
up = urllib.request.urlopen(url)
cont = up.read()
cont = cont.decode('utf-8')
cont = bs4.BeautifulSoup(cont,"lxml")

addr_seed = []

for link in cont.find_all('a'):
temp = link.get('href')
if(temp!=None):
if(temp.startswith('#')==True and temp.endswith('#')==False):
temp = re.sub('#',"",temp)
addr_seed.append(temp)
for i in addr_seed:
Download_addr = "http://data.netlab.360.com/feeds/dga/"+i+".txt"
f=requests.get(Download_addr)
with open("360DGA域名家族数据集/"+i+".txt","wb") as code:
code.write(f.content)

#以下代码为2019年06月08日22点完成的新功能——从原始数据中提取出DGA域名,
#并以新的txt文件集的形式生成数据集
for i in addr_seed:
with open("360DGA域名家族数据集/"+i+".txt","r") as f:
list = f.readlines()
for j in list:
if j.startswith('#')==False and j.startswith('\n')==False:
j = re.findall(r'\w+\.\w+',j)
j = str(j).strip("['").strip("']") + '\n'
with open("360DGA域名家族数据集清洗后/"+i+".txt","a") as f1:
f1.write(j)

结果如下: