精华内容
下载资源
问答
  • python,大众点评的两种字体加密全部破解,自行应对项目进行改编代码。ps:评论需要登录自己的账户获取cookies
  • 大众点评字体加密

    2020-12-30 18:13:48
    大众点评一共有601个加密字体,通用性加密,根据数据的类型更改加密字符,例如,地址加密使用address 的加密字体, 电话使用num 字体加密。加密字典{"unif27d": "1", "unie8f9": "2", "unie4a6": "3", "unif22f": "4...

    大众点评一共有601个加密字体,通用性加密,根据数据的类型更改加密字符,例如,地址加密使用address 的加密字体, 电话使用num 字体加密。

    加密字典

    {"unif27d": "1", "unie8f9": "2", "unie4a6": "3", "unif22f": "4", "unif510": "5", "unie0fc": "6", "uniebda": "7", "unif21e": "8", "unif2d3": "9", "unie9d6": "0", "unie67d": "店", "unie79a": "中", "unie623": "美", "unie498": "家", "unie902": "馆", "unie835": "小", "unif82e": "车", "unie2d2": "大", "uniebfe": "市", "unie161": "公", "unif4ea": "酒", "unif814": "行", "unie38b": "国", "uniec5c": "品", "unie9ab": "发", "unif766": "电", "unif686": "金", "unif338": "心", "unie8de": "业", "uniee2a": "商", "unie120": "司", "unie808": "超", "unie977": "生", "unie931": "装", "unieac1": "园", "uniebf9": "场", "unie7d3": "食", "unie14a": "有", "unif60e": "新", "unie3c9": "限", "unie7d4": "天", "unie228": "面", "uniebca": "工", "unif4b2": "服", "uniefdd": "海", "uniee34": "华", "unie96d": "水", "unie7a8": "房", "unif2a3": "饰", "uniefc1": "城", "unie3ff": "乐", "unieabf": "汽", "uniea54": "香", "unie9b5": "部", "unie84c": "利", "uniee20": "子", "unieba6": "老", "unif54f": "艺", "unied5a": "花", "unie97a": "专", "unif534": "东", "unif2fa": "肉", "unif17d": "菜", "uniea71": "学", "unif1cc": "福", "unie356": "饭", "unif426": "人", "unif4d8": "百", "unif6e6": "餐", "unif068": "茶", "unie5a7": "务", "unif32a": "通", "unif5c6": "味", "unie040": "所", "unie58a": "山", "unif2f3": "区", "unieecd": "门", "unie141": "药", "unif7a2": "银", "unif51f": "农", "unie4f7": "龙", "uniec49": "停", "unif8ba": "尚", "unif82f": "安", "unif28e": "广", "unie6d9": "鑫", "unie08e": "一", "unif466": "容", "unieb7c": "动", "unie1d4": "南", "unif19d": "具", "unif8d0": "源", "unie8c7": "兴", "uniec26": "鲜", "uniee23": "记", "uniec7c": "时", "unif3f2": "机", "unie6d7": "烤", "unie639": "文", "unif51e": "康", "unif5fc": "信", "unif683": "果", "unif134": "阳", "unie658": "理", "unif1ed": "锅", "unif5b5": "宝", "unie5e0": "达", "uniee8b": "地", "unif4b0": "儿", "unif8cb": "衣", "unie13a": "特", "unif5e8": "产", "unie38e": "西", "unie159": "批", "uniedb6": "坊", "unif49b": "州", "unie2a6": "牛", "unif61f": "佳", "unif8b4": "化", "uniec6c": "五", "unie266": "米", "unif798": "修", "unif7e0": "爱", "unif368": "北", "unif212": "养", "unie82a": "卖", "unieed3": "建", "unif32e": "材", "unie55a": "三", "unif733": "会", "unif25d": "鸡", "unif6a5": "室", "unieb44": "红", "unif8f7": "站", "uniefa4": "德", "unie643": "王", "unie6bc": "光", "unie03b": "名", "uniefd6": "丽", "unie048": "油", "unie645": "院", "unied4a": "堂", "unif7f5": "烧", "unie37e": "江", "unief11": "社", "unie29a": "合", "unie7b9": "星", "uniea5f": "货", "unied39": "型", "unie117": "村", "unief89": "自", "unie006": "科", "uniee36": "快", "unif3dc": "便", "unif189": "日", "unif43d": "民", "unif692": "营", "unif16e": "和", "unif08d": "活", "uniebf3": "童", "unie504": "明", "unie75f": "器", "unie6cd": "烟", "unif71b": "育", "unif1fa": "宾", "unif883": "精", "unieca4": "屋", "unif40d": "经", "unieba1": "居", "unif602": "庄", "unif6a4": "石", "uniea02": "顺", "unieb9e": "林", "unief0d": "尔", "unif1ec": "县", "uniea64": "手", "unied7a": "厅", "unieed0": "销", "unie1e3": "用", "unied36": "好", "unie83c": "客", "unif723": "火", "unied18": "雅", "unif122": "盛", "unied07": "体", "unie5ad": "旅", "uniefe9": "之", "unif68c": "鞋", "unieddb": "辣", "unie890": "作", "unie0dc": "粉", "unif18c": "包", "unif183": "楼", "uniec48": "校", "unif471": "鱼", "unie7a4": "平", "unif306": "彩", "uniedb4": "上", "unif64e": "吧", "unie410": "保", "unif417": "永", "uniec00": "万", "uniebc5": "物", "unif078": "教", "unif2bf": "吃", "unie65b": "设", "unieeb7": "医", "unieccd": "正", "unif378": "造", "unif8f4": "丰", "unif035": "健", "unied29": "点", "unie36c": "汤", "unif132": "网", "unied9b": "庆", "unif293": "技", "unie61c": "斯", "unif5a8": "洗", "uniecd5": "料", "unif109": "配", "unie959": "汇", "unif18e": "木", "unieef9": "缘", "unie163": "加", "unif886": "麻", "unie869": "联", "uniec1e": "卫", "unif8b5": "川", "unieda8": "泰", "unie324": "色", "unie3cc": "世", "unie261": "方", "unie3a0": "寓", "unif324": "风", "uniee1a": "幼", "unie934": "羊", "unie8ee": "烫", "unif752": "来", "unieb1d": "高", "unief48": "厂", "unif23b": "兰", "unif5bf": "阿", "unie04f": "贝", "unif045": "皮", "uniedff": "全", "unif166": "女", "unie82d": "拉", "unif58b": "成", "unif344": "云", "unif824": "维", "unie1f4": "贸", "unie21c": "道", "unif7d0": "术", "uniecc4": "运", "unif236": "都", "unie1f9": "口", "unif1d8": "博", "unif7ce": "河", "uniefcc": "瑞", "unie2c1": "宏", "unif270": "京", "unif49a": "际", "unif581": "路", "unif62c": "祥", "unif7e6": "青", "unie108": "镇", "unief01": "厨", "unie431": "培", "unieceb": "力", "unie5eb": "惠", "unif34e": "连", "unie07e": "马", "unie0d2": "鸿", "unie18e": "钢", "unie257": "训", "unie36a": "影", "unie7c5": "甲", "unie0f1": "助", "unie6d1": "窗", "unie136": "布", "unif864": "富", "unied19": "牌", "unif64f": "头", "uniecaa": "四", "unif0f7": "多", "unief57": "妆", "unie178": "吉", "unie6ce": "苑", "unif603": "沙", "unie4fb": "恒", "unie09f": "隆", "unieb0d": "春", "unif739": "干", "uniee46": "饼", "unif211": "氏", "unif562": "里", "unif29d": "二", "unie2bb": "管", "uniea72": "诚", "unie9d4": "制", "unif519": "售", "unie9b3": "嘉", "unif35d": "长", "unif2a6": "轩", "unieafe": "杂", "unie01e": "副", "unif219": "清", "unif770": "计", "unif095": "黄", "unie282": "讯", "unif1e8": "太", "unie7b7": "鸭", "unieb83": "号", "unif1f0": "街", "unie192": "交", "unie236": "与", "unie8ad": "叉", "uniefd9": "附", "unif504": "近", "unif0f5": "层", "unieadd": "旁", "unif1d1": "对", "uniec75": "巷", "uniea1e": "栋", "unif461": "环", "unie5f5": "省", "unif22e": "桥", "unie0d3": "段", "unie837": "乡", "unif420": "厦", "unie64d": "府", "unie9d1": "于", "unie229": "铺", "unie003": "内", "unie06a": "侧", "unie133": "元", "unie933": "购", "unie647": "前", "uniee14": "幢", "unie089": "滨", "unie4ad": "处", "unif26b": "向", "unif2e7": "座", "unieb1b": "下", "unif828": "鼎", "unie923": "凤", "unif2cf": "港", "unif87e": "开", "uniea48": "关", "unie524": "景", "unied41": "泉", "unif33c": "塘", "unie295": "放", "unif6ff": "昌", "unie88f": "线", "unif3b5": "湾", "unief5f": "政", "unie09a": "步", "unif480": "宁", "unif020": "解", "unif832": "白", "unie73f": "田", "unif0d8": "町", "unie8ae": "溪", "unif769": "十", "uniec6e": "八", "unie08a": "古", "unif431": "双", "unif494": "胜", "unie5de": "本", "unie179": "单", "unie069": "同", "unif19b": "九", "unif20e": "迎", "unif1b1": "第", "uniee9f": "台", "unie2d3": "玉", "unif810": "锦", "unie7ab": "底", "uniecce": "后", "unie05d": "七", "unieaa9": "斜", "unie7cb": "期", "unif2e3": "武", "unie242": "岭", "unif5c8": "松", "uniee7a": "角", "uniee60": "纪", "unif545": "朝", "unif71d": "峰", "unie794": "六", "unie376": "振", "unif42b": "珠", "unie708": "局", "unie8a0": "岗", "unif5be": "洲", "unie87e": "横", "uniea15": "边", "unif0f1": "济", "unif1da": "井", "unie9bc": "办", "unie194": "汉", "unieb1e": "代", "unif1f8": "临", "unied3f": "弄", "unif124": "团", "unie9b4": "外", "unie7e0": "塔", "unif20d": "杨", "unif7aa": "铁", "unif24f": "浦", "uniea8c": "字", "unie143": "年", "unieb02": "岛", "uniea9b": "陵", "unif221": "原", "unie85e": "梅", "unieb88": "进", "unif5f4": "荣", "unie730": "友", "uniefe7": "虹", "uniec16": "央", "unie885": "桂", "unie95f": "沿", "unied1e": "事", "unif738": "津", "unie14b": "凯", "unie7c7": "莲", "unie618": "丁", "unie361": "秀", "unif28a": "柳", "unif5f8": "集", "uniebba": "紫", "unif538": "旗", "unie4fe": "张", "unie460": "谷", "uniedb2": "的", "unif7b7": "是", "unie00b": "不", "unif675": "了", "unieebf": "很", "unif2d0": "还", "uniea14": "个", "unie398": "也", "uniea17": "这", "unif02a": "我", "unieb35": "就", "unif4de": "在", "unie1cc": "以", "unie49f": "可", "unied1f": "到", "unie715": "错", "unif6e1": "没", "unie510": "去", "uniee04": "过", "unif5f7": "感", "unie05f": "次", "unied55": "要", "unief15": "比", "unif7a6": "觉", "uniefde": "看", "uniee9e": "得", "unie948": "说", "unie2ac": "常", "unie268": "真", "unie628": "们", "unif87f": "但", "unif6fc": "最", "unie3e9": "喜", "unie70e": "哈", "unif5d3": "么", "unif388": "别", "unie96f": "位", "unif69a": "能", "unie893": "较", "unie0f2": "境", "unie088": "非", "unif85f": "为", "uniee6c": "欢", "unif805": "然", "unie657": "他", "unie720": "挺", "unie4d9": "着", "unif179": "价", "uniec60": "那", "unie188": "意", "unif204": "种", "unif264": "想", "unif4a9": "出", "uniebbe": "员", "uniecf8": "两", "uniefd4": "推", "unie77d": "做", "unie033": "排", "unie70d": "实", "uniee41": "分", "unieff5": "间", "uniecde": "甜", "unif401": "度", "unie65e": "起", "unif835": "满", "unif3e7": "给", "unieea9": "热", "unif286": "完", "unif39c": "格", "unie355": "荐", "unie04b": "喝", "unif5fa": "等", "unif04b": "其", "uniece1": "再", "unif323": "几", "unieb67": "只", "unie7ef": "现", "unie251": "朋", "unieef2": "候", "unif88d": "样", "unieffd": "直", "unif133": "而", "unied88": "买", "unie3af": "于", "unie7bd": "般", "uniefd3": "豆", "unif2af": "量", "unie974": "选", "unie1d7": "奶", "unie126": "打", "unie47e": "每", "unie945": "评", "unie759": "少", "uniec63": "算", "unie4aa": "又", "uniec7d": "因", "unie183": "情", "unieb91": "找", "unief4a": "些", "unif54e": "份", "unif66f": "置", "unie57b": "适", "unieb2a": "什", "unif123": "蛋", "unif590": "师", "unif0d3": "气", "unie19f": "你", "uniedb0": "姐", "unieb8d": "棒", "unie76b": "试", "unie6b7": "总", "unie48d": "定", "unif3a0": "啊", "unief20": "足", "unief39": "级", "unif37d": "整", "unie8e7": "带", "unif81e": "虾", "unif638": "如", "unif0b6": "态", "uniefff": "且", "unif173": "尝", "unie004": "主", "unif155": "话", "unie28e": "强", "unif643": "当", "unie053": "更", "unie4b5": "板", "unie613": "知", "unie9c9": "己", "unie624": "无", "uniead9": "酸", "uniecfe": "让", "unie975": "入", "unif0f2": "啦", "unie4a0": "式", "unif02d": "笑", "unif3e5": "赞", "unif001": "片", "uniee2d": "酱", "unif40b": "差", "unie508": "像", "unie690": "提", "unie0c4": "队", "unie96b": "走", "unief9b": "嫩", "unif4ee": "才", "unieee5": "刚", "unieed7": "午", "unif280": "接", "uniec0d": "重", "uniebd4": "串", "unie608": "回", "unif030": "晚", "unie6e8": "微", "unie27b": "周", "unie7fc": "值", "unif082": "费", "unie14d": "性", "unie7db": "桌", "uniec36": "拍", "uniee7b": "跟", "uniec98": "块", "unief12": "调", "unie8ea": "糕"}

    这是加密字典 加密方式不一样但是字符串是一样的 一共601个字符

    字符串的顺序也是不变的,所以先获取加密的字体文件

    # -*- coding:UTF-8 -*-

    import requests

    from fontTools.ttLib import TTFont

    import json

    url = 'http://s3plus.meituan.net/v1/mss_73a511b8f91f43d0bdae92584ea6330b/font/4eda5444.woff'

    response = requests.get(url=url)

    with open('b.woff', 'wb')as f:

    f.write(response.content)

    font = TTFont("./b.woff")

    # 获取加密字符的顺序

    font_names = font.getGlyphOrder()[2:]

    # 通过之前获取的加密字典生成新的加密字典

    with open('font.json', 'r', encoding='utf8')as f:

    json_dict = json.loads(f.read())

    json_values = json_dict.values()

    json_dict2 = {}

    for index, json_value in enumerate(json_values):

    json_dict2[font_names[index]] = json_value

    print(json_dict2)

    with open('hours.json', 'w', encoding='utf8')as f:

    f.write(json.dumps(json_dict2, ensure_ascii=False))

    生成新的加密字典 然后替换掉加密字符串

    # -*- coding:UTF-8 -*-

    import json

    import re

    import pandas

    import requests

    import urllib3

    from pyquery import PyQuery as pq

    from retry import retry

    class DianPing:

    def __init__(self):

    self.start_url = "http://www.dianping.com/zhengzhou/ch45/g150"

    self.headers = {

    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"

    }

    self.login_header = {

    'Proxy-Connection': 'keep-alive',

    'Cache-Control': 'max-age=0',

    'Upgrade-Insecure-Requests': '1',

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',

    'Referer': 'http://www.dianping.com/zhengzhou/ch45/g33844r7462',

    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',

    }

    tunnel = "tps198.kdlapi.com:15818"

    username = "t10886694756492"

    password = "bjgfg7jn"

    self.proxies = {

    "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel},

    "https": "https://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel}

    }

    self.all_list = []

    def get_font_dict(self):

    path = "address_font.json"

    with open(path, 'r', encoding='utf8')as f:

    self.address_font = json.loads(f.read())

    business_hours_path = 'business_hours.json'

    with open(business_hours_path, 'r', encoding='utf8')as f:

    self.business_hours_font = json.loads(f.read())

    hours_path = 'hours.json'

    with open(hours_path, 'r', encoding='utf8')as f:

    self.hours_font = json.loads(f.read())

    def get_area(self):

    response = requests.get(url=self.start_url, headers=self.headers, proxies=self.proxies)

    doc = pq(response.content.decode())

    area_list = doc('#region-nav a').items()

    for area in area_list:

    area_url = area.attr('href')

    area_name = area('span').text()

    self.get_page_content(area_name, area_url)

    def get_page_content(self, area_name, area_url):

    response = requests.get(url=area_url, headers=self.headers, proxies=self.proxies)

    doc = pq(response.content.decode())

    ul = doc('#shop-all-list li').items()

    for li in ul:

    title = li('.tit h4').text()

    title_url = li('.tit a').attr('href')

    if title_url:

    self.get_store_info(area_name, title, title_url)

    next = doc('.next')

    if next:

    next_url = next.attr('href')

    self.get_page_content(area_name, next_url)

    @retry()

    def get_store_info(self, area, title, title_url):

    store_dict = {}

    store_dict['名称'] = title

    store_dict['行政区'] = area

    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

    response = requests.get(url=title_url, headers=self.login_header, verify=False, proxies=self.proxies)

    response_page = response.content.decode()

    address_info = re.findall('(.*?)', response_page)

    if address_info:

    address_info = address_info[0]

    result_address = re.findall('(.*?)', address_info)

    for info in result_address:

    info_key = info.replace('', 'uni').replace(';', '')

    info_value = self.address_font[info_key]

    address_info = re.sub(f'{info}', info_value, address_info)

    result_num = re.findall('(.*?)', address_info)

    for num in result_num:

    num_key = num.replace('', 'uni').replace(';', '')

    num_value = self.address_font[num_key]

    address_info = re.sub(f'{num}', num_value, address_info)

    address_doc = pq(address_info).text()

    store_dict['地址'] = address_doc

    phone = re.findall('

    (.*?)

    ', response_page)

    if phone:

    phone_info = re.sub('电话:', '', phone[0])

    result_num = re.findall('(.*?)', phone_info)

    for num in result_num:

    num_key = num.replace('', 'uni').replace(';', '')

    num_value = self.address_font[num_key]

    phone_info = re.sub(f'{num}', num_value, phone_info)

    phone_num = pq(phone_info).text()

    store_dict['电话'] = phone_num

    business_hours_html = re.findall('

    (.*?)

    ', response_page)

    if business_hours_html:

    business_hours_html = business_hours_html[0]

    svgmtsi = re.findall('(.*?)', business_hours_html)

    for svg in svgmtsi:

    svg_key = svg.replace('', 'uni').replace(';', '')

    svg_value = self.business_hours_font[svg_key]

    business_hours_html = re.sub(f'{svg}', svg_value, business_hours_html)

    result_num = re.findall('(.*?)', business_hours_html)

    for num in result_num:

    num_key = num.replace('', 'uni').replace(';', '')

    num_value = self.hours_font[num_key]

    business_hours_html = re.sub(f'{num}', num_value, business_hours_html)

    business_hours = pq(business_hours_html).text().replace('修改', '').replace('营业时间: ','')

    store_dict['营业时间'] = business_hours

    print(store_dict)

    self.all_list.append(store_dict)

    def run(self):

    self.get_font_dict()

    self.get_area()

    pandas.DataFrame(self.all_list).to_excel('体育场馆.xlsx', index=False)

    if __name__ == '__main__':

    DianPing().run()

    cookie 只需要登录即可 不限制ip

    展开全文
  • 字体加密初认识不少网站都使用了字体库对数据进行加密,即页面源码中的数据与显示出来的数据不同。实现的效果和昨天发布的那篇关于 X 薯中文网的效果类似,但是原理大不相同。在字...

    字体加密初认识

    不少网站都使用了字体库对数据进行加密,即页面源码中的数据与显示出来的数据不同。

    实现的效果和昨天发布的那篇关于 X 薯中文网的效果类似,但是原理大不相同。

    Python爬虫进阶必备 | X薯中文网加密分析

    在字体加密的网站中用户也是无法直接进行复制网页内容的。

    目前有使用字体加密的网站大概有下面这些:

    58同城,起点,猫眼,大众点评,启信宝,天眼查,实习僧,汽车之家
    

    既然这么多的网站都采用了字体加密,那么它一定是一个有效的反爬手段,作为爬虫工程师我们应该如何应对呢?

    首先我们应该先了解一下什么是字体加密。

    什么是字体加密?

    网页字体是一个字形集合,而每个字形是描述字母或符号的矢量形状。

    因此,特定字体文件的大小由两个简单变量决定:每个字形矢量路径的复杂程度和特定字体中字形的数量。

    通俗一点,同一内容的网页字体每个字形应该都是大同小异的,我们可以通过比对字体文件的字形来确认映射的内容。

    关于字体加密的描述文章实在太多了,个人建议还是看看 Google 官方关于网页字体的文章。

    https://developers.google.com/web/fundamentals/performance/optimizing-content-efficiency/webfont-optimization?hl=zh-cn

    接着是关于网页字体加密映射的原理图,下图来自谷雨解字:

    https://guyujiezi.com/

    在爬虫爬取页面的时候,页面中的代码是阴书,但是在人眼看到的是原文,这样的映射关系让爬虫无法顺利爬取到网站内容。

    如何处理字体加密?

    通过上面的字体加密原理图,我们大概了解到字体加密的原理。

    我推荐没有搞过字体加密的朋友找个比较简单的网站练手,网上写的比较多的例子是猫眼的专业版。

    猫眼的字体解密文章非常多,建议没有接触过得朋友可以先自己动手试试,咸鱼也会在之后更新字体解密系列的文章。

    下面是字体解密的大致流程:

    • 先找到字体文件的位置,查看源码大概就是xxx.tff这样的文件

    • 重复上面那个操作,将两个字体文件保存下来

    • 用上面的软件或者网址打开,并且通过 Python fontTools 将 tff 文件解析为 xml 文件

    • 根据字体文件解析出来的 xml 文件与类似上面的字体界面找出相同内容的映射规律(重点)

    • 在 Python 代码中把找出的规律实现出来,让你的代码能够通过这个规律还原源代码与展示内容的映射(这句话比较抽象,可以之后结合代码文章再读一遍)

    字体解密相关资源

    咸鱼这里直接上资源的链接。

    一是查看字体的软件 FontCreator,支持 window 。

    链接:https://pan.baidu.com/s/1tUznnSB3siI2rVY9Whv88A  密码:ygz9

    二是打开字体的网站,适合 mac 系统的朋友使用。

    先通过 cloudconvert 把字体文件转化为 svg 后再用 fontello 打开查看

    http://fontello.com/

    https://cloudconvert.com/ttf-to-svg

    如果嫌弃上面转换太过麻烦可以用百度字体打开。

    http://fontstore.baidu.com/static/editor/index.html

    咸鱼推荐使用 FontCreator 以及百度字体。

    打开后显示的样子与下图类似。

    EOF

    展开全文
  • python爬虫破解字体加密案例 本次案例以爬取起小点小说为例 案例目的: 通过爬取起小点小说月票榜的名称和月票数,介绍如何破解字体加密的反爬,将加密的数据转化成明文数据。 程序功能: 输入要爬取的页数,得到每...

    python爬虫破解字体加密案例

    本次案例以爬取起小点小说为例

    案例目的:

    通过爬取起小点小说月票榜的名称和月票数,介绍如何破解字体加密的反爬,将加密的数据转化成明文数据。

    程序功能:

    输入要爬取的页数,得到每一页对应的小说名称和月票数。

    案例分析:

    找到目标的url:

    在这里插入图片描述

    (右键检查)找到小说名称所在的位置:

    在这里插入图片描述

    通过名称所在的节点位置,找到小说名称的xpath语法:

    在这里插入图片描述

    (右键检查)找到月票数所在的位置:

    在这里插入图片描述
    由上图发现,检查月票数据的文本,得到一串加密数据。

    我们通过xpathhelper进行调试发现,无法找到加密数据的语法。因此,需要通过正则表达式进行提取。

    通过正则进行数据提取。

    在这里插入图片描述
    正则表达式如下:
    在这里插入图片描述
    得到的加密数据如下:
    在这里插入图片描述

    破解加密数据是本次案例的关键:

    既然是加密数据,就会有加密数据所对应的加密规则的Font文件。
    通过找到Font字体文件中数据加密文件的url,发送请求,获取响应,得到加密数据的woff文件。

    注:我们需要的woff文件,名称与加密月票数前面的class属性相同。
    在这里插入图片描述
    如下图,下载woff文件:

    找到16进制的数字对应的英文数字。

    在这里插入图片描述
    其次,我们需要通过第三方库TTFont将文件中的16进制数转换成10进制,将英文数字转换成阿拉伯数字。如下图:
    在这里插入图片描述
    解析出每个加密数据对应的对应的月票数的数字如下:
    在这里插入图片描述

    注意:

    由于我们在上面通过正则表式获得的加密数据携带特殊符号
    在这里插入图片描述
    因此解析出月票数据中的数字之后,除了将特殊符号去除,还需把每个数字进行拼接,得到最后的票数。

    最后,通过对比不同页的url,找到翻页的规律:

    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    对比三个不同url发现,翻页的规律在于参数page

    所以问题分析完毕,开始代码:

    import requests
    from lxml import etree
    import re
    from fontTools.ttLib import TTFont
    import json
    
    if __name__ == '__main__':
        # 输入爬取的页数、
        pages = int(input('请输入要爬取的页数:'))  # eg:pages=1,2
        for i in range(pages):  # i=0,(0,1)
            page = i+1   # 1,(1,2)
            # 确认目标的url
            url_ = f'https://www.qidian.com/rank/yuepiao?page={page}'
            # 构造请求头参数
            headers = {
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
            }
            # 发送请求,获取响应
            response_ = requests.get(url_,headers=headers)
            # 响应类型为html问文本
            str_data = response_.text
            # 将html文本转换成python文件
            py_data = etree.HTML(str_data)
            # 提取文本中的目标数据
            title_list = py_data.xpath('//h4/a[@target="_blank"]/text() ')
            # 提取月票数,由于利用xpath语法无法提取,因此换用正则表达式,正则提取的目标为response_.text
            mon_list = re.findall('</style><span class=".*?">(.*?)</span></span>',str_data)
            print(mon_list)
            # 获取字体反爬woff文件对应的url,xpath配合正则使用
            fonturl_str = py_data.xpath('//p/span/style/text()')
            font_url = re.findall(r"format\('eot'\); src: url\('(.*?)'\) format\('woff'\)",str_data)[0]
            print(font_url)
            # 获得url之后,构造请求头获取响应
            headers_ = {
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
                'Referer':'https://www.qidian.com/'
            }
            # 发送请求,获取响应
            font_response = requests.get(font_url,headers=headers_)
            # 文件类型未知,因此用使用content格式
            font_data = font_response.content
            # 保存到本地
            with open('加密font文件.woff','wb')as f:
                f.write(font_data)
            # 解析加密的font文件
            font_obj = TTFont('加密font文件.woff')
            # 将文件转成明文的xml文件
            font_obj.saveXML('加密font文件.xml')
            # 获取字体加密的关系映射表,将16进制转换成10进制
            cmap_list = font_obj.getBestCmap()
            print('字体加密关系映射表:',cmap_list)
            # 创建英文转英文的字典
            dict_e_a = {'one':'1','two':'2','three':'3','four':'4','five':'5','six':'6',
                        'seven':'7','eight':'8','nine':'9','zero':'0'}
            # 将英文数据进行转换
            for i in cmap_list:
                for j in dict_e_a:
                    if j == cmap_list[i]:
                        cmap_list[i] = dict_e_a[j]
            print('转换为阿拉伯数字的映射表为:',cmap_list)
            # 去掉加密的月票数据列表中的符号
            new_mon_list = []
            for i in mon_list:
                list_ = re.findall(r'\d+',i)
                new_mon_list.append(list_)
            print('去掉符号之后的月票数据列表为:',new_mon_list)
            # 最终解析月票数据
            for i in new_mon_list:
                for j in enumerate(i):
                    for k in cmap_list:
                        if j[1] == str(k):
                            i[j[0]] = cmap_list[k]
            print('解析之后的月票数据为:',new_mon_list)
            # 将月票数据进行拼接
            new_list = []
            for i in new_mon_list:
                j = ''.join(i)
                new_list.append(j)
            print('解析出的明文数据为:',new_list)
            # 将名称和对应的月票数据放进字典,并转换成json格式及进行保存
            for i in range(len(title_list)):
                dict_ = {}
                dict_[title_list[i]] = new_list[i]
                # 将字典转换成json格式
                json_data = json.dumps(dict_,ensure_ascii=False)+',\n'
                # 将数据保存到本地
                with open('翻页起小点月票榜数据爬取.json','a',encoding='utf-8')as f:
                    f.write(json_data)
    

    爬取了两页的数据,每一页包含20个数据

    执行结果如下:

    在这里插入图片描述

    展开全文
  • 抖音字体加密

    2020-03-28 11:15:07
    抖音==起因==:==分析==:字体文件:==...我寻思这玩意看起来是字体加密,但是我并不知道这个字体文件时多久更新一次,要是像猫眼字体反爬那样,刷新一下就变,那我就太孤儿了,所以就延长了工期,结果,人客户走了...

    起因:

    前天本来接的一个小单子,一开始客户上来问抖音粉丝能获取吗?我寻思这玩意看起来是字体加密,但是我并不知道这个字体文件是多久更新一次,要是像猫眼字体反爬那样,刷新一下就变,那就太孤儿了,所以就磨蹭了一会儿看看是不是会更新(好吧我承认我之前没搞过抖音字体加密所以还没分析过),结果,人客户走了,走了,了。。好吧,索性后面就来分析一波,发现他这个抖音字体加密字体文件是不会变的,那就好整了阿,分分钟的事。

    分析:
    字体文件:

    第一步当然是确定加密字体在哪里,我们可以打开开发者工具,按F12,进入泰克模式(进入开发者工具)
    在这里插入图片描述
    上图中的框框里就是我们想要的粉丝数量,But,他是一个一个小正方形,在源代码中显示的是JavaScript中的unicode编码,我们获取不到我们想要的粉丝数量。其实这是因为他用了自己的一套字体加密规则,所以我们只要找到这套规则,就可以获取到我们的数据了。

    获取字体文件:

    在这里插入图片描述
    从这个我们就可以进入到他的加载字体的地方,我们进去看到如下:
    在这里插入图片描述
    而我们要的字体文件就在这个地方,一般的字体文件后缀都是ttfwoff
    我们找woff这个文件的,并且在浏览器中可以直接下载他,然后接下来就数撸代码的时候了。

    这里我就不再介绍如何去获取页面的信息了,本文主要是分析一下字体加密。

    获取字体映射规则:

    这里我们介绍fonttool这个工具,有了这个库我们就可以在python中操作字体文件了。

    安装:

    pip install fontTools

    使用:

    https://blog.csdn.net/weixin_43411585/article/details/103484643

    分析映射规则:

    我们在保存的字体文件中找到这个后缀会xml的文件,打开并且找到cmap(cmap 是字符与字体字形对应的表)
    在这里插入图片描述
    我们看到了一一对应的关系,而这个 num_ 这些有时什么玩意呢?其实这些就算抖音的字体映射了,我们可以利用一些工具获取到他的对应规则
    在这里插入图片描述
    是不是很简单!当然这种是傻瓜式的操作,后期我会更新一个方法就算他的字体每天变一次都是没有关系的。工具在上面的使用中有介绍,并且有安装链接,我就不再赘述了

    代码部分:
    # -*- coding: utf-8 -*-
    # @Time    : 2020/3/27 23:43
    # @Author  : Key-lei
    # @File    : 抖音粉丝2020.3.27.py
    import requests
    from fontTools.ttLib import TTFont
    from parsel import Selector
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    # 分析得到得字体文件
    url = 'https://s3.pstatp.com/ies/resource/falcon/douyin_falcon/static/font/iconfont_9eb9a50.woff'
    
    # 这些映射关系是再  当前文件夹下xml中找到得
    real_2_map = {
        'x': '',
        'num_': '1', 'num_1': '0', 'num_2': '3', 'num_3': '2', 'num_4': '4', 'num_5': '5', 'num_6': '6', 'num_7': '9',
        'num_8': '7',
        'num_9': '8'
    }
    
    two_2_sixteenn = {
        "&#xe602": "num_",
        "&#xe603": "num_1",
        "&#xe604": "num_2",
        "&#xe605": "num_3",
        "&#xe606": "num_4",
        "&#xe607": "num_5",
        "&#xe608": "num_6",
        "&#xe609": "num_7",
        "&#xe60a": "num_8",
        "&#xe60b": "num_9",
        "&#xe60c": "num_4",
        "&#xe60d": "num_1",
        "&#xe60e": "num_",
        "&#xe60f": "num_5",
        "&#xe610": "num_3",
        "&#xe611": "num_2",
        "&#xe612": "num_6",
        "&#xe613": "num_8",
        "&#xe614": "num_9",
        "&#xe615": "num_7",
        "&#xe616": "num_1",
        "&#xe617": "num_3",
        "&#xe618": "num_",
        "&#xe619": "num_4",
        "&#xe61a": "num_2",
        "&#xe61b": "num_5",
        "&#xe61c": "num_8",
        "&#xe61d": "num_9",
        "&#xe61e": "num_7",
        "&#xe61f": "num_6",
    }
    
    
    # 拿到字体文件
    def get_font(url):
        response_woff = requests.get(url)
        filename = url.split('/')[-1]
        with open(filename, "wb") as f:
            f.write(response_woff.content)
        return filename
    
    
    def get_map_url(font_name):
        # 把字体文件读取为python能理解的对象
        base_font = TTFont(font_name)
        base_font.saveXML('font.xml')
        # 获取映射规则
        for key in two_2_sixteenn.keys():
            two_2_sixteenn[key] = real_2_map[two_2_sixteenn[key]]
        return two_2_sixteenn
    
    
    # 格式化输出
    def format_str(character):
        character = [x.replace(" ", '') for x in character]
        character = "".join(character)
        character = character.replace(";", '')
        return character
    
    
    if __name__ == '__main__':
        font_name = get_font(url)
        two_2_sixteenn = get_map_url(font_name)
        # 得到请求页面
        response = requests.get(
            'https://v.douyin.com/77hwgS/',
            headers=headers)
        with open('替换之前的.html', mode='w', encoding='utf-8') as f:
            f.write(response.text)
        new_html = response.text
        # 替换html中得字体
        for key, value in two_2_sixteenn.items():
            new_html = new_html.replace(str(key).lower(), value)
        with open('替换之后的.html', mode='w', encoding='utf-8') as f:
            f.write(new_html)
        # 提取粉丝数量
        selector = Selector(new_html)
        fans_num = format_str(selector.xpath('//span[@class="follower block"]//span[@class="num"]//text()').getall())
        star_num = format_str(selector.xpath('//span[@class="focus block"]//span[@class="num"]//text()').getall())
        like_num = format_str(selector.xpath('//span[@class="liked-num block"]//span[@class="num"]//text()').getall()
        print('粉丝的数量:', fans_num)
        print('关注的数量:', star_num)
        print('赞的数量:', like_num)
    
    
    效果:

    在这里插入图片描述

    github传送门:https://github.com/Key-lei/ZitiSpider/tree/master

    展开全文
  • 破解字体加密,以58同城网站为例。 字体加密是爬取网页的过程中比较麻烦的问题。 字体加密一般是网页修改了默认的字符编码集,在网页上加载的他们自己定义的字体文件作为字体的样式,可以正确地显示数字,但是在...
  • 字体加密分析

    2020-08-30 16:36:40
    字体加密分析        字体加密,已经变成了常见的一种反爬措施。。如何破解呢?现在开始一步步的分析。。        这里先要准备2个重要的...
  • 破解字体加密解决思路

    千次阅读 2019-07-25 10:11:09
    原理:关于字体加密,其实是将一种特定的字体库来代替浏览器本身的字体库显示的过程 以58字体库加密为例 58同城中,无论是简历中的字体加密,还是房产信息中的加密都是有迹可循的;正如我们所知,加密的字体...
  • Python爬虫实例——2021字体加密反爬对策

    千次阅读 多人点赞 2021-02-12 15:52:55
    Python爬虫实例——2021猫眼票房字体加密反爬策略 前言: 猫眼票房页面的字体加密是动态的,每次或者每天加载页面的字体文件都会有所变化,本篇内容针对这种加密方式进行分析 字体加密原理:简单来说就是程序员在...
  • 58同城字体加密-多套字体文件

    千次阅读 2021-04-08 10:53:57
    历时3天(每天上午,下午需要干别的事情)终于把58同城多套字体加密搞出来了(本人菜鸟 所以搞得久了一点),有感兴趣的同学私聊我一起讨论,也可以加我微信18300485357 代码就不放在这里了 字体加密重要的是思路,...
  • 1.网页分析 爬虫嘛,最主要还是先分析分析网页。... 请求方法:GET 是否需要验证头信息:不晓得,需要代码测试 2....发现数字在里面又被显示成: 这才反应过来,应该是字体加密了。再翻了翻网页源代码,在
  • 有些网站为了反爬,对网页中的一些数据进行了字体加密,用户浏览网页时显示的是正常的,但是爬取网页源代码时,却是乱码。 原因 页面在css中使用font-face定义了字符集,并通过unicode去映射展示,浏览器会加载css...
  • 对于CSS字体加密,我目前遇到了这几种情况: 字体字形坐标点与编码之间的对应关系不会随着多次请求而变化,例如:58同城房子出租 字体字形坐标点每次请求时,位置不固定,但是每个文字的打点数量一致,例如:猿人学...
  • 原来有些网站上使用了字体加密技术,为了解决这个问题,我找了大量的资料,可是网上的很多方法由于网站反爬技术的进步或者网站更新了字体加密规则已经不能使用了,于是我就开始了破解字体加密的艰辛历程。...
  • 大众点评字体加密.py

    2021-09-22 17:17:45
    适合爬虫学习,交流
  • 最近看到一些博主在讲解加密字体的破解方式,大体的解决方式是分析网页源代码,通过请求查看自定义字体,然后经过数据抓取完成需求。 这个方法确实很不错,但是对于我这种不太会爬虫的小白来说就与一些超纲了。...
  • 58字体加密解决思路

    2019-06-03 11:07:54
    关于字体加密,其实是将一种特定的字体库来代替浏览器本身的字体库显示的过程 58字体库加密方式 58同城中,无论是简历中的字体加密,还是房产信息中的加密都是有迹可循的;正如我们所知,加密的字体数量一般都不会太...
  • 破解字体加密,以58同城网站为例。

    万次阅读 热门讨论 2018-11-30 16:28:42
    字体加密是爬取网页的过程中比较麻烦的问题。 字体加密一般是网页修改了默认的字符编码集,在网页上加载的他们自己定义的字体文件作为字体的样式,可以正确地显示数字,但是在源码上同样的二进制数由于未加载自定义...
  • 点评店铺woff字体加密(一) 网址 :http://www.dianping.com/shanghai/ch10/g110 网页的问题和 数字也用了字体加密,用的是 woff 文件字体格式 ,woff文件名 每天会变化是动态的,也需要从HTML通过css获取引用的...
  • 大众点评字体加密破解代码
  • python爬虫破解大众点评的字体加密

    千次阅读 2019-10-18 03:14:44
    大众点评最大的难题就是字体加密,也就是这样 很明显点评网宁可错杀一万,也不肯放过一个,干脆使用了自己定义的字体。不过既然自定义了字体,那么网页肯定需要加载字体文件。 谷歌浏览器,右键检查,进入...
  • 本篇讨论大众点评搜索页的字体加密,相关代码也可以在github中找到。首先查看加密:请求回来后发现html源码为乱码。所以我们的目标就是找到“乱码”和数字的映射,替换即可。那么问题来了,哪来的这个映射呢?就是...
  • 【导语】我们在爬取数据中,会遇到字体乱码的下,其实是字体加密,本篇文章主要解决字体解密这种反爬方式。 1.在浏览器打开58同城网址进入北京租房 2.点击检查,找到房租价格对应位置,发现源码中价格部分是乱码,但是...
  • 其实这里是模仿了起点小说的小说加密反爬虽然可能有不同只是实现效果差不多。 PHP >= 5.6 ,node.js ,composer ,一份完整的字体(这里以微软雅黑做测试 msyh.ttf) npm 安装一些插件 npm install -g ttf2svg //...
  • 还是主要说一下它的字体加密。 其实就是这个啦,我估计有不少人都遇到过这种反扒措施但都没有美团的这个那么令人难受。之前的某些网站用这种反爬措施多少都能从转出来xml文件中找到一些规律,但美团的是完全没有...
  • 猿人学题库十三题——动态css字体加密 1. 首先 进入 浏览器的开发者工具, 在 xhr 可以看到 ajax请求数据接口,返回的数据有一个 woff 键,这里就是 woff的base64格式 请求获取 数据后 把 woff键的值 base64...
  • 字体反爬也就是自定义字体加密映射,通过调用自定义的字体文件来渲染网页中的文字,而网页中的文字不再是文字,而是相应的字体编码,通过复制或者简单的采集是无法采集到编码后的文字内容的。 2.查看字体软件font ...
  • 字体加密当然会有请求字体文件的。所以就找字体文件。 在字体数据的下方竟然有字体文件的地址,这就又很轻松了。 用正则表达式匹配出字体的地址即可。 font_url = re.findall(r"src: url\('(.*?)'\)", html)[1] # ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 35,927
精华内容 14,370
关键字:

字体加密