精华内容
下载资源
问答
  • 爬虫抓取雪球网用户动态(Tweets),接口:https://xueqiu.com/v4/statuses/user_timeline.json #### 使用说明 1. 使用前对脚本关键位置进行如下修改: 打开浏览器并登录雪球网账号,获取cookie并替换代码中的...
  • 雪球网

    2018-08-16 00:37:00
    import json import requests import pymysql from mysql_test import mysql_conn # 因为不能访问, 所以我们加个头试试 headers = { ... #'Accept-Encoding': 'gzip, deflate, br', ... #'A...
    import json
    import requests
    import pymysql
    from mysql_test import mysql_conn
    
    # 因为不能访问, 所以我们加个头试试
    headers = {
        #'Accept': '*/*',
        #'Accept-Encoding': 'gzip, deflate, br',
        #'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        #'Connection': 'keep-alive',
        'Cookie': 'aliyungf_tc=AQAAALoQF3p02gsAUhVFebQ3uBBNZn+H; xq_a_token=584d0cf8d5a5a9809761f2244d8d272bac729ed4; xq_a_token.sig=x0gT9jm6qnwd-ddLu66T3A8KiVA; xq_r_token=98f278457fc4e1e5eb0846e36a7296e642b8138a; xq_r_token.sig=2Uxv_DgYTcCjz7qx4j570JpNHIs; _ga=GA1.2.516718356.1534295265; _gid=GA1.2.1050085592.1534295265; u=301534295266356; device_id=f5c21e143ce8060c74a2de7cbcddf0b8; Hm_lvt_1db88642e346389874251b5a1eded6e3=1534295265,1534295722; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1534295722',
        #'Host': 'xueqiu.com',
        #'Referer': 'https://xueqiu.com/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
        #'X-Requested-With': 'XMLHttpRequest',
        #'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    
    # urllib 的相关操作如下
    url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=-1&count=10&category=111'
    #
    
    
    response = requests.get(url, headers=headers)
    
    ## 转化函数 res_dict = json.loads(res)
    res_dict = json.loads(response.text)
    
    list_list = res_dict['list']
    #print(list_list)
    # 遍历 list_list
    for list_item_dict in list_list:
        # list 列表内的一个item, 他是一个dict
        data_str = list_item_dict['data']
        # print(data_str)
        data_dict = json.loads(data_str)
        data = {}
        data['data_id'] = data_dict['id']
        data['data_title'] = data_dict['title']
        data['data_description'] = data_dict['description']
        data['data_target'] = data_dict['target']
        # print('-'*50)
        # print(data)
        try:
            sql = 'insert into xueqiu(data_id,data_title,data_description,data_target) values("{data_id}","{data_title}","{data_description}","{data_target}")'.format(**data)
            mc = mysql_conn()
            mc.execute_modify_mysql(sql)
        except:
            pass
    import pymysql
    
    class mysql_conn(object):
        # 魔术方法, 初始化, 构造函数
        def __init__(self):
            self.db = pymysql.connect(host='127.0.0.1', user='root', password='lxh1122', port=3306, database='py11')
            self.cursor = self.db.cursor()
    
        # 执行modify(修改)相关的操作
        def execute_modify_mysql(self, sql):
            self.cursor.execute(sql)
            self.db.commit()
    
        # 魔术方法, 析构化 ,析构函数
        def __del__(self):
            self.cursor.close()
            self.db.close()
    
    if __name__=='__main__':
        sql = 'insert into xueqiu values (3)'
        mc = mysql_conn()
        mc.execute_modify_mysql(sql)
        sql = 'insert into xueqiu values (4)'
    
        mc.execute_modify_mysql(sql)
        sql = 'insert into xueqiu values (5)'
    
        mc.execute_modify_mysql(sql)
        sql = 'insert into xueqiu values (6)'
    
        mc.execute_modify_mysql(sql)

     

    转载于:https://www.cnblogs.com/lxh777/p/9484858.html

    展开全文
  • python爬取雪球网交易数据

    千次阅读 2020-08-13 15:51:24
    雪球网交易数据爬取,python源码。 雪球是一个投资者的社交网络平台,爬取交易数据。 代码: def get_trade_behavior(uid): import requests import random import time import json result = [] res = [] ...

    雪球网交易数据爬取,python源码。
    雪球是一个投资者的社交网络平台,爬取交易数据。

    代码:

    def get_trade_behavior(uid):
        import requests
        import random
        import time
        import json
        result = []
        res = []
        headers = [{
                       'User-Agent': "Mozilla/5.0 (X11; CrOS x86_64 10066.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
                       'Accept': 'text/html;q=0.9,*/*;q=0.8',
                       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                       'Connection': 'close'},
                   {
                       'User-Agent': "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1 (KHTML, like Gecko) CriOS/69.0.3497.100 Mobile/13B143 Safari/601.1.46",
                       'Accept': 'text/html;q=0.9,*/*;q=0.8',
                       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                       'Connection': 'close'},
                   {
                       'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
                       'Accept': 'text/html;q=0.9,*/*;q=0.8',
                       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                       'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
                    'Accept': 'application/json, text/plain, */*',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
                    'Accept': 'application/json, text/plain, */*',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'}]
    
        s = requests.Session()
        s.keep_alive = False
        # t = 1
        try:
            # while True:
            url = "https://xueqiu.com/service/tc/snowx/PAMID/cubes/rebalancing/history?cube_symbol=SP" + uid + "&count=20&page=1"
            obj = s.get(url, headers=random.choice(headers), stream=True, allow_redirects=False).json()
            time.sleep(random.random() * 3)
            maxpage = obj["maxPage"]
            # if obj["list"] != []:
            for k in range(1, maxpage + 1):
                url = "https://xueqiu.com/service/tc/snowx/PAMID/cubes/rebalancing/history?cube_symbol=SP" + uid + "&count=20&page=" + str(k)
                print("正在检索{%s}-第%d页-总共%d页" % (uid, k, maxpage))
                obj = s.get(url, headers=random.choice(headers), stream=True, allow_redirects=False).json()
                time.sleep(random.random() * 3)
                for i in obj["list"]:
                    res.append(uid)
                    time_stamp = i["updated_at"]
                    time_stamp_10 = int(round(time_stamp) / 1000)
                    time_local = time.localtime(time_stamp_10)
                    trade_time = time.strftime("%Y-%m-%d %H:%M:%S", time_local)
                    trade_history_stock_name = i["rebalancing_histories"][0]["stock_name"]
                    trade_history_stock_symbol = i["rebalancing_histories"][0]["stock_symbol"]
                    trade_history_stock_prev_weight = i["rebalancing_histories"][0]["prev_weight_adjusted"]
                    trade_history_stock_target_weight = i["rebalancing_histories"][0]["target_weight"]
                    trade_history_stock_exec_price = i["rebalancing_histories"][0]["price"]
                    res.append(trade_time)
                    res.append(trade_history_stock_name)
                    res.append(trade_history_stock_symbol)
                    res.append(trade_history_stock_prev_weight)
                    res.append(trade_history_stock_target_weight)
                    res.append(trade_history_stock_exec_price)
                    res_copy = res.copy()
                    result.append(res_copy)
                    res.clear()
            print("{%s} 检索完毕!" % uid)
            return result
        except:
            print("{%s} 异常!" % uid)
            return [uid, "异常"]
    
    
    
    
    
    def read_csv(name):
        import csv
        '''读取CSV文件数据'''
        csv_file = csv.reader(open("C:\\Users\\viemax\\Desktop\\" + name + ".csv", "r"))
        object_website = []
        for i in csv_file:
            object_website.append(i)
            # print(i)
        return object_website
    
    
    
    no_data_id = read_csv("no_data_id")
    
    obj = []
    for i in no_data_id[2:]:
        obj.append(i[1])
    
    
    res = []
    for i in obj[0::2]:
        r = get_trade_behavior(i)
        res.append(r)
    
    
    
    
    def xueqiu(num):
        import requests
        from bs4 import BeautifulSoup
        import random
        import time
        url = u"https://xueqiu.com/P/SP" + num
        headers = [{'User-Agent': "Mozilla/5.0 (X11; CrOS x86_64 10066.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
                    'Accept': 'text/html;q=0.9,*/*;q=0.8',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1 (KHTML, like Gecko) CriOS/69.0.3497.100 Mobile/13B143 Safari/601.1.46",
                    'Accept': 'text/html;q=0.9,*/*;q=0.8',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
                    'Accept': 'text/html;q=0.9,*/*;q=0.8',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
                    'Accept': 'application/json, text/plain, */*',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
                    'Accept': 'application/json, text/plain, */*',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'}]
        cookie = [dict(cookies_are="device_id=33a80200aacb73cf594a45942b285a12; _ga=GA1.2.312459015.1529772425; s=ey177hmx06; bid=ae1522508305909e11f0ccaefc21ae37_jn93s7rs; __utmz=1.1539536073.4.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; Hm_lvt_fe218c11eab60b6ab1b6f84fb38bcc4a=1539591917; _gid=GA1.2.758749044.1540657586; aliyungf_tc=AQAAAIe8YFC/zwwAKvJZ2tC9k8DvMt34; __utmc=1; __utma=1.312459015.1529772425.1540825606.1540828390.19; remember=1; remember.sig=K4F3faYzmVuqC0iXIERCQf55g2Y; xq_a_token.sig=p4pCAuWXphKrks3IjEzTbJFCcb4; xqat.sig=uWTQIYsOCqtgymFewPvkgLk8CyM; xq_r_token.sig=Q9P70D5S5ZuHuFEXVJ6umTRqL1o; xq_is_login.sig=J3LxgPVPUzbBg3Kee_PquUfih7Q; u.sig=Ra3Ht4oGmAXu5VtkPBpRXum-Ntc; Hm_lvt_1db88642e346389874251b5a1eded6e3=1540825899,1540828382,1540829378,1540829450; snbim_minify=true; __utmt=1; _gat_gtag_UA_16079156_4=1; xq_a_token=18b7f7dec4f54032863219716eaf839ee940199d; xqat=18b7f7dec4f54032863219716eaf839ee940199d; xq_r_token=f27bcc9f6c7b6446279ee9448db195b118b8f17c; xq_token_expire=Sat%20Nov%2024%202018%2001%3A55%3A26%20GMT%2B0800%20(CST); xq_is_login=1; u=7147604028; __utmb=1.52.10.1540828390; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540835763"),
                  dict(cookie_are="device_id=33a80200aacb73cf594a45942b285a12; _ga=GA1.2.312459015.1529772425; s=ey177hmx06; bid=ae1522508305909e11f0ccaefc21ae37_jn93s7rs; __utmz=1.1539536073.4.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; Hm_lvt_fe218c11eab60b6ab1b6f84fb38bcc4a=1539591917; _gid=GA1.2.758749044.1540657586; aliyungf_tc=AQAAAIe8YFC/zwwAKvJZ2tC9k8DvMt34; __utmc=1; __utma=1.312459015.1529772425.1540825606.1540828390.19; Hm_lvt_1db88642e346389874251b5a1eded6e3=1540825899,1540828382,1540829378,1540829450; snbim_minify=true; __utmt=1; xq_token_expire=Sat%20Nov%2024%202018%2001%3A55%3A26%20GMT%2B0800%20(CST); __utmb=1.52.10.1540828390; _gat_gtag_UA_16079156_4=1; remember=1; remember.sig=K4F3faYzmVuqC0iXIERCQf55g2Y; xq_a_token=b2f21e25cd1817bf15c1c89cc72b25ad537495de; xq_a_token.sig=p4pCAuWXphKrks3IjEzTbJFCcb4; xqat=b2f21e25cd1817bf15c1c89cc72b25ad537495de; xqat.sig=uWTQIYsOCqtgymFewPvkgLk8CyM; xq_r_token=bb8e27cca180872ab70314097a5077578ff119c8; xq_r_token.sig=Q9P70D5S5ZuHuFEXVJ6umTRqL1o; xq_is_login=1; xq_is_login.sig=J3LxgPVPUzbBg3Kee_PquUfih7Q; u=1559188240; u.sig=Ra3Ht4oGmAXu5VtkPBpRXum-Ntc; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540835848"),
                  dict(cookie_are="device_id=33a80200aacb73cf594a45942b285a12; _ga=GA1.2.312459015.1529772425; s=ey177hmx06; bid=ae1522508305909e11f0ccaefc21ae37_jn93s7rs; __utmz=1.1539536073.4.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; Hm_lvt_fe218c11eab60b6ab1b6f84fb38bcc4a=1539591917; _gid=GA1.2.758749044.1540657586; aliyungf_tc=AQAAAIe8YFC/zwwAKvJZ2tC9k8DvMt34; __utmc=1; __utma=1.312459015.1529772425.1540825606.1540828390.19; Hm_lvt_1db88642e346389874251b5a1eded6e3=1540825899,1540828382,1540829378,1540829450; snbim_minify=true; __utmt=1; remember=1; remember.sig=K4F3faYzmVuqC0iXIERCQf55g2Y; xq_a_token.sig=p4pCAuWXphKrks3IjEzTbJFCcb4; xqat.sig=uWTQIYsOCqtgymFewPvkgLk8CyM; xq_r_token.sig=Q9P70D5S5ZuHuFEXVJ6umTRqL1o; xq_is_login.sig=J3LxgPVPUzbBg3Kee_PquUfih7Q; u.sig=Ra3Ht4oGmAXu5VtkPBpRXum-Ntc; xq_a_token=b70e7188d32f804237b6a42c052b5bcf74ebeea2; xqat=b70e7188d32f804237b6a42c052b5bcf74ebeea2; xq_r_token=b004ebba4649dfef7bba54f6ae7b703e5bca6a61; xq_token_expire=Sat%20Nov%2024%202018%2001%3A58%3A30%20GMT%2B0800%20(CST); xq_is_login=1; u=1497969916; __utmb=1.56.10.1540828390; _gat_gtag_UA_16079156_4=1; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540835925"),
                  dict(cookie_are="device_id=33a80200aacb73cf594a45942b285a12; _ga=GA1.2.312459015.1529772425; s=ey177hmx06; bid=ae1522508305909e11f0ccaefc21ae37_jn93s7rs; __utmz=1.1539536073.4.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; Hm_lvt_fe218c11eab60b6ab1b6f84fb38bcc4a=1539591917; _gid=GA1.2.758749044.1540657586; __utma=1.312459015.1529772425.1540825606.1540828390.19; xq_token_expire=Sat%20Nov%2024%202018%2001%3A58%3A30%20GMT%2B0800%20(CST); aliyungf_tc=AQAAAAVyoiWa1w4AKvJZ2ozyzTPwnciM; Hm_lvt_1db88642e346389874251b5a1eded6e3=1540829378,1540829450,1540836740,1540866196; remember=1; remember.sig=K4F3faYzmVuqC0iXIERCQf55g2Y; xq_a_token=4458f8df93a013c35835d0320917b19dcaab0a24; xq_a_token.sig=FfAS5LGC_XBO11rmXuA6Nb3o4VI; xqat=4458f8df93a013c35835d0320917b19dcaab0a24; xqat.sig=t2g7eE2UG80Frcg03R-7nudVIBA; xq_r_token=4812b56991883e9913998e8816706912bff911e8; xq_r_token.sig=R6AgMpKf0fhe6GkWdS_etJ0Y3Dw; xq_is_login=1; xq_is_login.sig=J3LxgPVPUzbBg3Kee_PquUfih7Q; u=6146826778; u.sig=h5P6Xki5cmObHzNcRMVufpWUnZc; _gat_gtag_UA_16079156_4=1; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540866325")]
        s = requests.Session()
        # s.keep_alive = False
        try:
            cookies = random.choice(cookie)
            obj = s.get(url, headers=random.choice(headers), cookies=cookies, stream=True, allow_redirects=False, timeout=20)
            time.sleep(8 + random.random() * 3.2)
            bs = BeautifulSoup(obj.content, 'lxml')
        except requests.exceptions.Timeout:
            print([num, "timeout", "timeout"])
            return [num, "timeout", "timeout"]
        try:
            try:
                res_current = bs.find_all(attrs={"class": "cube-closed"})[0].get_text()
            except IndexError:
                res_current = "未关停!"
            res_id = bs.find_all(attrs={"class": "creator fn-clear"})[0].attrs["href"]
            s.close()
            print([num, res_id[1:], res_current])
            return [num, res_id[1:], res_current]
        except IndexError:
            try:
                res_404 = bs.find("title").get_text()
                if res_404 == "404_雪球":
                    s.close()
                    print([num, "NaN", res_404])
                    return [num, "NaN", res_404]
            except AttributeError:
                s.close()
                print([num, "AttributeError", "page_error"])
                return [num, "AttributeError", "page_error"]
    
    result = []
    res_final = []
    res_final.extend(res)
    res_final.extend(res_0)
    
    
    for i in res_final:
        if i != []:
            result.append(i)
    
    final = []
    for i in result:
        if i[1] != "异常":
            final.append(i)
    
    
    except_id = []
    for i in result:
        if i[1] == "异常":
            except_id.append(i)
            
    
    need = []
    for i in final:
        need.extend(i)
    
    展开全文
  • 爬取分析雪球网实盘用户数据

    千次阅读 2020-08-13 16:03:31
    雪球网实盘用户(地理位置信息)进行爬取,数据分析。 结论: 1、遍历了约20%的雪球用户,检索到近9000个活跃实盘用户,预估所有实盘用户在50000个左右。 2、样本中约9000个实盘用户,大约一半近期实盘有操作;...

    对雪球网实盘用户(地理位置信息)进行爬取,数据分析。

    结论:
    1、遍历了约20%的雪球用户,检索到近9000个活跃实盘用户,预估所有实盘用户在50000个左右。
    2、样本中约9000个实盘用户,大约一半近期实盘有操作;另外一半实盘被关闭,应该可以检索到历史数据。
    3、基于目前样本,发现1265个有明确地理位置的用户,其地区分布与雪球官方公布的地区分布基本一致。

    详细数据:
    1、本次遍历雪球网,初步检索到 8253 个有实盘数据的用户,其中:
    a. 现在依旧活跃的实盘用户:3808个
    b. 目前已经关停实盘,只保留历史调仓记录的用户:3111个
    c. 网站数据异常,需要再处理的用户:1118个
    d. 网络连接超时,需要再处理的用户:276个

    2、在3808个依旧活跃的实盘用户中,有明确地理位置的有1265个,地理位置分布如下:
    在这里插入图片描述
    3、下图是雪球官方公布的用户分布图:
    在这里插入图片描述

    爬虫代码L:

        cookie = [dict(cookies_are="device_id=06934df365e4a0fdf3e5c1efc4a302fd; __utmz=1.1526174181.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; s=fn15nngiri; _ga=GA1.2.191434752.1526174181; bid=ae1522508305909e11f0ccaefc21ae37_jn8z7lmc; aliyungf_tc=AQAAAEKOuhTXlwQABPNZ2lE9FlkxU221; snbim_minify=true; __utmc=1; __utma=1.191434752.1526174181.1540635588.1540638493.40; Hm_lvt_1db88642e346389874251b5a1eded6e3=1539567244,1539567517,1540450735,1540638620; _gid=GA1.2.1574455490.1540638621; xq_a_token=18b7f7dec4f54032863219716eaf839ee940199d; xqat=18b7f7dec4f54032863219716eaf839ee940199d; xq_r_token=f27bcc9f6c7b6446279ee9448db195b118b8f17c; xq_token_expire=Wed%20Nov%2021%202018%2019%3A41%3A19%20GMT%2B0800%20(CST); xq_is_login=1; u=7147604028; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540641065; __utmb=1.37.9.1540640362715"),
                  dict(cookie_are="device_id=06934df365e4a0fdf3e5c1efc4a302fd; __utmz=1.1526174181.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; s=fn15nngiri; _ga=GA1.2.191434752.1526174181; bid=ae1522508305909e11f0ccaefc21ae37_jn8z7lmc; aliyungf_tc=AQAAAEKOuhTXlwQABPNZ2lE9FlkxU221; snbim_minify=true; __utmc=1; __utma=1.191434752.1526174181.1540635588.1540638493.40; __utmt=1; Hm_lvt_1db88642e346389874251b5a1eded6e3=1539567244,1539567517,1540450735,1540638620; _gid=GA1.2.1574455490.1540638621; xq_a_token=4458f8df93a013c35835d0320917b19dcaab0a24; xqat=4458f8df93a013c35835d0320917b19dcaab0a24; xq_r_token=4812b56991883e9913998e8816706912bff911e8; xq_is_login=1; u=6146826778; xq_token_expire=Wed%20Nov%2021%202018%2019%3A17%3A51%20GMT%2B0800%20(CST); Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540638965; __utmb=1.8.9.1540638934329"),
                  dict(cookie_are="device_id=06934df365e4a0fdf3e5c1efc4a302fd; __utmz=1.1526174181.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; s=fn15nngiri; _ga=GA1.2.191434752.1526174181; bid=ae1522508305909e11f0ccaefc21ae37_jn8z7lmc; aliyungf_tc=AQAAAEKOuhTXlwQABPNZ2lE9FlkxU221; snbim_minify=true; __utmc=1; __utma=1.191434752.1526174181.1540635588.1540638493.40; Hm_lvt_1db88642e346389874251b5a1eded6e3=1539567244,1539567517,1540450735,1540638620; _gid=GA1.2.1574455490.1540638621; __utmt=1; xq_a_token=b2f21e25cd1817bf15c1c89cc72b25ad537495de; xqat=b2f21e25cd1817bf15c1c89cc72b25ad537495de; xq_r_token=bb8e27cca180872ab70314097a5077578ff119c8; xq_is_login=1; u=1559188240; xq_token_expire=Wed%20Nov%2021%202018%2019%3A24%3A59%20GMT%2B0800%20(CST); Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540639362; __utmb=1.14.9.1540638934329"),
                  dict(cookie_are="device_id=06934df365e4a0fdf3e5c1efc4a302fd; __utmz=1.1526174181.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; s=fn15nngiri; _ga=GA1.2.191434752.1526174181; bid=ae1522508305909e11f0ccaefc21ae37_jn8z7lmc; aliyungf_tc=AQAAAEKOuhTXlwQABPNZ2lE9FlkxU221; snbim_minify=true; __utmc=1; __utma=1.191434752.1526174181.1540635588.1540638493.40; Hm_lvt_1db88642e346389874251b5a1eded6e3=1539567244,1539567517,1540450735,1540638620; _gid=GA1.2.1574455490.1540638621; __utmt=1; xq_a_token=b70e7188d32f804237b6a42c052b5bcf74ebeea2; xqat=b70e7188d32f804237b6a42c052b5bcf74ebeea2; xq_r_token=b004ebba4649dfef7bba54f6ae7b703e5bca6a61; xq_token_expire=Wed%20Nov%2021%202018%2019%3A27%3A29%20GMT%2B0800%20(CST); xq_is_login=1; u=1497969916; _gat_gtag_UA_16079156_4=1; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540639507; __utmb=1.18.9.1540639426395")]
    
    
    def get_location(num):
        res = []
        import requests
        import random
        import time
        import json
        url = u"https://xueqiu.com/statuses/original/show.json?user_id=" + num[3]
        url_1 = u"https://xueqiu.com/account/oauth/user/show.json?source=sina&userid=" + num[3]
        headers = [{'User-Agent': "Mozilla/5.0 (X11; CrOS x86_64 10066.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
                    'Accept': 'text/html;q=0.9,*/*;q=0.8',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1 (KHTML, like Gecko) CriOS/69.0.3497.100 Mobile/13B143 Safari/601.1.46",
                    'Accept': 'text/html;q=0.9,*/*;q=0.8',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
                    'Accept': 'text/html;q=0.9,*/*;q=0.8',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
                    'Accept': 'application/json, text/plain, */*',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'},
                   {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
                    'Accept': 'application/json, text/plain, */*',
                    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                    'Connection': 'close'}]
        cookie = [dict(cookies_are=u"device_id=06934df365e4a0fdf3e5c1efc4a302fd; __utmz=1.1526174181.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; s=fn15nngiri; _ga=GA1.2.191434752.1526174181; aliyungf_tc=AQAAADY46BenmA0AefNZ2iIVV7Y6rtgH; __utmc=1; xq_a_token.sig=XglA1uiAYkfyfKlhbuJdRhRTTM4; xq_r_token.sig=jW7KrLgtGYffUvfG3DfPexDR8RQ; xq_a_token=7c41909f4604aa33eb26b7c175f0468a1df2152b; xqat=7c41909f4604aa33eb26b7c175f0468a1df2152b; xq_r_token=b1914a7d50798c67bb7852f09954b82aa41a4a0b; xq_token_expire=Thu%20Nov%2008%202018%2022%3A39%3A32%20GMT%2B0800%20(CST); xq_is_login=1; u=7147604028; bid=ae1522508305909e11f0ccaefc21ae37_jn8z7lmc; snbim_minify=true; Hm_lvt_1db88642e346389874251b5a1eded6e3=1539526506,1539528869,1539567244,1539567517; _gid=GA1.2.1869629804.1540126302; __utma=1.191434752.1526174181.1540126285.1540168048.33; __utmt=1; __utmb=1.31.10.1540168048; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540171090"),
                  dict(cookie_are=u"device_id=06934df365e4a0fdf3e5c1efc4a302fd; __utmz=1.1526174181.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; s=fn15nngiri; _ga=GA1.2.191434752.1526174181; bid=ae1522508305909e11f0ccaefc21ae37_jn8z7lmc; aliyungf_tc=AQAAAEKOuhTXlwQABPNZ2lE9FlkxU221; snbim_minify=true; __utmc=1; __utma=1.191434752.1526174181.1540635588.1540638493.40; __utmt=1; Hm_lvt_1db88642e346389874251b5a1eded6e3=1539567244,1539567517,1540450735,1540638620; _gid=GA1.2.1574455490.1540638621; xq_a_token=4458f8df93a013c35835d0320917b19dcaab0a24; xqat=4458f8df93a013c35835d0320917b19dcaab0a24; xq_r_token=4812b56991883e9913998e8816706912bff911e8; xq_is_login=1; u=6146826778; xq_token_expire=Wed%20Nov%2021%202018%2019%3A17%3A51%20GMT%2B0800%20(CST); Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540638965; __utmb=1.8.9.1540638934329"),
                  dict(cookie_are="device_id=06934df365e4a0fdf3e5c1efc4a302fd; __utmz=1.1526174181.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; s=fn15nngiri; _ga=GA1.2.191434752.1526174181; bid=ae1522508305909e11f0ccaefc21ae37_jn8z7lmc; aliyungf_tc=AQAAAEKOuhTXlwQABPNZ2lE9FlkxU221; snbim_minify=true; __utmc=1; __utma=1.191434752.1526174181.1540635588.1540638493.40; Hm_lvt_1db88642e346389874251b5a1eded6e3=1539567244,1539567517,1540450735,1540638620; _gid=GA1.2.1574455490.1540638621; __utmt=1; xq_a_token=b2f21e25cd1817bf15c1c89cc72b25ad537495de; xqat=b2f21e25cd1817bf15c1c89cc72b25ad537495de; xq_r_token=bb8e27cca180872ab70314097a5077578ff119c8; xq_is_login=1; u=1559188240; xq_token_expire=Wed%20Nov%2021%202018%2019%3A24%3A59%20GMT%2B0800%20(CST); Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540639362; __utmb=1.14.9.1540638934329"),
                  dict(cookie_are="device_id=06934df365e4a0fdf3e5c1efc4a302fd; __utmz=1.1526174181.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; s=fn15nngiri; _ga=GA1.2.191434752.1526174181; bid=ae1522508305909e11f0ccaefc21ae37_jn8z7lmc; aliyungf_tc=AQAAAEKOuhTXlwQABPNZ2lE9FlkxU221; snbim_minify=true; __utmc=1; __utma=1.191434752.1526174181.1540635588.1540638493.40; Hm_lvt_1db88642e346389874251b5a1eded6e3=1539567244,1539567517,1540450735,1540638620; _gid=GA1.2.1574455490.1540638621; __utmt=1; xq_a_token=b70e7188d32f804237b6a42c052b5bcf74ebeea2; xqat=b70e7188d32f804237b6a42c052b5bcf74ebeea2; xq_r_token=b004ebba4649dfef7bba54f6ae7b703e5bca6a61; xq_token_expire=Wed%20Nov%2021%202018%2019%3A27%3A29%20GMT%2B0800%20(CST); xq_is_login=1; u=1497969916; _gat_gtag_UA_16079156_4=1; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1540639507; __utmb=1.18.9.1540639426395")]
        s = requests.Session()
        s.keep_alive = False
        try:
            obj = s.get(url, headers=random.choice(headers), cookies=random.choice(cookie), stream=True, allow_redirects=False).json()
            time.sleep(random.random() * 16)
            name = obj['user']['screen_name']
            gender = obj['user']['gender']
            province = obj['user']['province']
            city = obj['user']['city']
            followers_count = obj['user']['followers_count']
            friends_count = obj['user']['friends_count']
            status_count = obj['user']['status_count']
            stocks_count = obj['user']['stocks_count']
            res.append(num[2])
            res.append(num[3])
            res.append(name)
            res.append(gender)
            res.append(province)
            res.append(city)
            res.append(followers_count)
            res.append(friends_count)
            res.append(status_count)
            res.append(stocks_count)
            try:
                obj_1 = s.get(url_1, headers=random.choice(headers), cookies=random.choice(cookie), stream=True,
                            allow_redirects=False).json()
                time.sleep(random.random() * 15)
                weibo_uid = obj_1['id']
                res.append(weibo_uid)
                s.close()
                print(res)
                return res
            except KeyError or json.decoder.JSONDecodeError or IndexError:
                res.append("weibo地址不存在")
                s.close()
                print(res)
                return res
        except json.decoder.JSONDecodeError:
            res.append(num)
            res.append("异常")
            s.close
            print(res)
            return res
    
    xueqiu_all = xueqiu_all_data[1:]
    
    if __name__ == "__main__":
        final = []
        for num in xueqiu_all[0::40]:
            try:
                data = get_location(num)
                final.append(data)
            except KeyError:
                print("KeyError")
                pass
    
    t = 0
    for i in test[1:]:
        if i[5] in ["北京", "上海", "天津"] or i[6] not in ["", "未知", "其他", "异常", "不限", "城市/地区", None]:
            t += 1
            print(i)
    print(t)
    
    展开全文
  • 以下将从处理cookie的两种方式来分析爬取雪球网的新闻数据, 一个是手动处理cookie,到源码去抓包,找到他的request header里面的cookie,复制出来封装到headers内; 另一个是自动处理cookie,引入模块requests的session...

    以下将从处理cookie的两种方式来分析爬取雪球网的新闻数据,

    一个是手动处理cookie,到源码去抓包,找到他的request header里面的cookie,复制出来封装到headers内;

    另一个是自动处理cookie,引入模块requests的session,这个方法和requests一样可以发送get和post请求,但是他在发送请求的时候自动携带cookie

    那么什么情况下我们爬取数据需要用到cookie呢?

    首先我们要知道cookie他是服务器记录客户端的一种状态,有一些网站的爬取,需要先通过验证信息才可以进去主页面,如果我们在爬取大量数据的情况下,可能需要反复登录才可以爬取,这样效率不高,下面通过简单分析爬取雪球网的新闻标题来区分两种爬取的异同

    我们先分析网站是不是ajax动态获取数据,通过xhr我们看到每次向下拉,xhr总是有新的数据显示,说明该网站是ajax动态获取数据的

    分析完网站的基本情况之后,我们先按照之前的通用爬虫去爬取数据

    import requests
     
    url="https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20369998&count=15&category=-1"
    headers ={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}
    page_json  = requests.get(url=url,headers=headers).json()
    print(page_json)

    输出结果为:

    {'error_description': '遇到错误,请刷新页面或者重新登录帐号后再试', 'error_uri': '/v4/statuses/public_timeline_by_category.json', 'error_data': None, 'error_code': '400016'}
    

    出现上面的这种原因是因为cookie的反爬机制

    手动处理cookie

    import requests
    
    url="https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20369998&count=15&category=-1"
    headers ={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
              "Cookie": "aliyungf_tc=AQAAADriOUCilQoAxZ5btPQfYv7152ox; acw_tc=2760824915856669537353368e2ea5d4c1b87e45dadece330ae07e755b96f1; xq_a_token=2ee68b782d6ac072e2a24d81406dd950aacaebe3; xqat=2ee68b782d6ac072e2a24d81406dd950aacaebe3; xq_r_token=f9a2c4e43ce1340d624c8b28e3634941c48f1052; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTU4NzUyMjY2MSwiY3RtIjoxNTg1NjY2OTA4NDgwLCJjaWQiOiJkOWQwbjRBWnVwIn0.YCQ_yUlzhRvTiUgz1BWWDFrsmlxSgsbaaKs0cxsdxnOaMhIjF0qUX-5WNeqfRXe15I5cPHiFf-5AzeRZgjy0_bSId2-jycpDWuSIseOY07nHM306A8Y1vSJJx4Q9gFnWx4ETpbdu1VXyMYKpwVIKfmSb5sbGZYyHDJPQQuNTfIAtPBiIeHWPDRB-wtf0qa5FNSMK3LKHRZooXjUgh-IAFtQihUIr9D81tligmjNYREntMY1gLg5Kq6GjgivfF9CFc11sJ11fZxnSw9e8J_Lmx8XXxhwHv-j4-ANUSIuglM4cT6yCsWa3pGAVMN18r2cV72JNkk343I05DevQkbX8_A; u=481585666954081; Hm_lvt_1db88642e346389874251b5a1eded6e3=1585666971; device_id=24700f9f1986800ab4fcc880530dd0ed; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1585667033"}
    page_json  = requests.get(url=url,headers=headers).json()
    print(page_json)
    
    手动处理cookies爬取雪球网

    自动处理cookie

    import requests
    from lxml import etree
    session = requests.Session()
    
    url="https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20369998&count=15&category=-1"
    headers ={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}
    
    # 第一步,向雪球网首页发送一条请求,获取cookie
    session.get(url="https://xueqiu.com",headers=headers)
    # 第二步,获取动态加载的新闻数据
    page_json = session.get(url=url,headers=headers).json()
    print(page_json)
    
    自动处理cookie爬取雪球网

     

    展开全文
  • 雪球网相关Python爬虫

    2017-09-26 15:35:20
    一些雪球网的Python爬虫整理~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~不知道什么鬼策略,莫名把下载积分提了这么高.~~~~~~~~~~~~~~~
  • 这个博客主要实现了雪球网股票评论的爬取,爬取的内容为沪深300股票的评论。 一、爬虫的原理 这部分就不多提了,一些基础的博客我也整理过爬虫从零开始。python爬虫第N课系列学完前十课就对这篇博客的代码完全清楚...
  • 爬取雪球网的股票信息评论

    千次阅读 2019-04-10 15:22:54
    昨天接到一个面试公司的电话,对我进行了一个前期的考评,其目标就是让我爬取雪球网的股票评论信息,当接到该公司的这个爬虫题目的时候,其实是挺兴奋的,然后在第二天早上起来,就开始爬取,花了大概一个半个小时的...
  • 爬虫 - 股票爬虫实例之雪球网

    千次阅读 2020-04-12 17:30:08
    跟着老师做的,但是老师讲的百度股票已经没有了,所以用雪球网替代了,不过没有输出,也没有报错,请大神帮忙看看什么问题,谢谢! import re import requests from bs4 import BeautifulSoup def getHTMLText(url,...
  • 雪球网爬取

    千次阅读 2018-08-15 22:10:00
    import json import requests import pymysql # 因为不能访问, 所以我们加个头试试 headers = { #'Accept': '*/*', #'Accept-Encoding': 'gzip, deflate, br', #'Accept-Language': 'zh-CN,zh;q=0.9,en;... #'...
  • 室友上半年跟了一个做机器学习方向的导师做股票投资组合的项目,暑假...这里分享一下我在爬取雪球网数据时遇到的问题,一方面是对自己项目的一个小结,另一方面给其他需要爬取雪球网数据的小伙伴们提供一些参考,也...
  • Python爬虫 爬取雪球网部分数据

    千次阅读 2018-08-15 22:54:55
    import requests import json url = { 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=-1&count=10&category=111', '...
  • 最近有朋友需要帮忙写个爬虫脚本,爬取雪球网一些上市公司的财务数据。盆友希望可以根据他自己的选择进行自由的抓取,所以简单给一份脚本交给盆友,盆友还需要自己搭建python环境,更需要去熟悉一些参数修改的操作,...
  • 简单使用resquests爬取雪球网数据,分析股票走势import requestsimport pymongoimport json# 数据库初始化client = pymongo.MongoClient("localhost", 27017)# 获得数据库db = client.gupiao# 获得集合stu...
  • 雪球网数据爬取

    2018-08-16 00:16:00
    1 import requests 2 import json 3 import pymysql 4 5 class mysql_conn(object): 6 # 魔术方法, 初始化, 构造函数 7 def __init__(self): ... 8 self.db = pymysql.connect(host='127.0.0.1',...
  • 点击上方“菜鸟学Python”,选择“星标”公众号超级无敌干货第一时间推给你!!!最近有盆友需要帮忙写个爬虫脚本,爬取雪球网一些上市公司的财务数据。盆友希望可以根据他自己的选择进行自由的...
  • from urllib import request,parse from piaot import * import json import pymysql yeshu是输入的页数 def sql(sql_z): # 打开数据库连接 db = pymysql.connect(“192.168.43.128”, “root”, ...
  • 股票网络蜘蛛 xueqiu.com等股票网站webspider
  • 关于雪球网数据的爬取

    千次阅读 2018-08-16 08:04:12
    # # 通过request 请求我们的雪球 # response = request.urlopen(req) # # res = response.read() # ## 字符串, 需要转成dict/list response=requests.get(url,headers=headers) # res = response.content #print(res...
  • 完整代码公众号回复'雪球网'关键字即可 公众号:pythonislover 记得要设置延迟噢,我们是一只文明的爬虫~~~ 忘了说了,cookie会过期,需要及时更新cookie 投稿来源:小爬虫 转载于:...
  • 数据之美----雪球网股票组合分析

    千次阅读 2016-10-13 00:22:24
    主要功能是爬取雪球组合的数据,并进行分析,得出一些有价值有潜力的股票代码。实盘到没有测试,随便估计了一下,至少跟盘不会出现亏损。先普及一下组合的概念:由雪球、微博等平台的投资主理人管理的股票持仓池。...
  • 这个爬虫写得好累,就简单讲...雪球网股票的评论内容是不能直接访问的,必须要携带在第一次访问时雪球网写进本地的cookie(其实你随便打开一次官网就是属于第一次访问了,那时候 不需要cookie),先放上github地址: ...
  • 如何备份雪球网

    2014-11-11 19:08:21
    雪球网是个很有价值的网站,如何备份雪球网,利用雪球网备份工具BlogDown可以轻松完成。这里对如何利用备份工具BlogDown来完整备份雪球网进行了介绍,给出了备份雪球网的详细过程。 1. 雪球网背景   雪球网,是...
  • 网站:雪球网 思路:上传关键字,爬取搜索结果网页,将有结果的公司信息抓取下来并存入数据库 1、在雪球网输入公司名搜索,发现返回3个结果,其中search.json?code是我想要的文件 2、这个是雪球网的一个api,...
  • 爬取雪球网股票信息(一)

    千次阅读 2019-04-01 15:20:26
    这篇是爬取股票名称,我用了三个模块来实现。Crwal_Share_Names、Saved_MongDB、DEFINITION. 首先是最简单的一个模块:DEFINITION,定义,这个模块里面定义了要爬取的url序列和HEADERS, 其url是如图网站中的: ...
  • import json import requests import pymysql class mysql_connect(object): # 初始化的构造函数 def __init__(self): self.db = pymysql.connect(host='127.0.0.1', user='root', password='123456', por...
  • # 通过request 请求我们的雪球 response = request.urlopen(req) res = response.read() # byte类型 ## 字符串, 需要转成dict/list #print(res) ## 转化函数 res_dict = json.loads(res) res_dict = json.loads(res...
  • 效果图 ... 进入网站之后我们打开开发工具之后,往下滑时会出现一个接口(当然滑的越多接口越多) ...我们通过对比两个及以上的接口进行分析它们的不同之处(这叫找规律) 可以发现max_id是在变化的,其他都是不变的...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 6,392
精华内容 2,556
关键字:

雪球网