精华内容
下载资源
问答
  • python下载笔趣阁小说生成txt文档

    千次阅读 2019-10-30 20:02:33
    最近在看一本小说,每次点击下一章,就要等哈,而且有的还有广告,有点烦,就下载个txt了,没有广告,没有等待,哈哈 代码如下 # coding=utf-8 import requests from bs4 import BeautifulSoup # 设置请求头 ...

    最近在看一本小说,每次点击下一章,就要等哈,而且有的还有广告,有点烦,就下载个txt了,没有广告,没有等待,哈哈

    代码如下

    # coding=utf-8
    import requests
    from bs4 import BeautifulSoup
    
    # 设置请求头
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
    
    # 设置获取python岗位数量的url
    url = "https://www.biquge.info/35_35517/"
    
    # 获取响应
    response = requests.get(url, headers=headers)
    # 解码
    html = response.content.decode('utf-8')
    
    # 创建bs4对象
    soup = BeautifulSoup(html, "html5lib")
    list = []
    for soup0 in soup.find_all("dd"):
        url1 = "https://www.biquge.info/35_35517/" + soup0.find("a").get('href')
        title=soup0.find("a").text
        print(url1)
        # 获取响应
        tf = 1
        while (tf == 1):
            try:
                response1 = requests.get(url1, headers=headers)
                # 解码
                html1 = response1.content.decode('utf8')
                # 创建bs4对象
                soup1 = BeautifulSoup(html1, "html5lib")
                text0 = soup1.find(id="content").text
                print(text0)
                with open("zuowangchangsheng.txt", "a",encoding='utf-8') as f:
                    f.write(title)
                    f.write(text0)
                tf = 0
            except Exception as e:
                print(e)
                tf = 1

    结果

     

     

    展开全文
  • 环境:pycharm2017、python3.7 对于初学者来说,爬取小说是最简单的应用,而对于没有任何语法基础的人来说,清晰的逻辑往往比大段的代码更重要。 整个过程分为以下几步: 1.确定爬取目标(网页,前段页面) 首先要...

    什么是爬虫

    网络爬虫,也叫网络蜘蛛(spider),是一种用来自动浏览万维网的网络机器人。其目的一般为编纂网络索引。

    网络搜索引擎等站点通过爬虫软件更新自身的网站内容或其对其他网站的索引。网络爬虫可以将自己所访问的页面保存下来,以便搜索引擎事后生成索引供用户搜索。

    爬虫访问网站的过程会消耗目标系统资源。不少网络系统并不默许爬虫工作。因此在访问大量页面时,爬虫需要考虑到规划、负载,还需要讲“礼貌”。 不愿意被爬虫访问、被爬虫主人知晓的公开站点可以使用robots.txt文件之类的方法避免访问。这个文件可以要求机器人只对网站的一部分进行索引,或完全不作处理。

    互联网上的页面极多,即使是最大的爬虫系统也无法做出完整的索引。因此在公元2000年之前的万维网出现初期,搜索引擎经常找不到多少相关结果。现在的搜索引擎在这方面已经进步很多,能够即刻给出高质量结果。

    爬虫还可以验证超链接和HTML代码,用于网络抓取。

    环境:pycharm2017、python3.7

    对于初学者来说,爬取小说是最简单的应用,而对于没有任何语法基础的人来说,清晰的逻辑往往比大段的代码更重要。

    整个过程分为以下几步:

    1.确定爬取目标(网页,前段页面)

    首先要明确爬虫的原理,是从网页源代码进行进行数据爬取,本次是以http://www.92kshu.cc/69509/为例,进行小说爬取

    2.分析代码,进行数据爬取

    主要用到的是python的正则表达式,对想要爬取数据进行选择

    title = re.findall(r‘‘,html)[0]

    在此语句中,用的是re库,对字符进行筛选,从网页代码中找到独一无二的标志代码段,进行筛选,如果一次不能直接筛选,则可进行多重,比如实例中,先爬取html,然后爬取dl,只是为了爬取对应章节的地址和每一章节的标题。

    用re.findall(r‘‘)进行匹配,需匹配的位置用(.*?)代替.

    正则表达式表

    模式描述

    ^

    匹配字符串的开头

    $

    匹配字符串的末尾。

    .

    匹配任意字符,除了换行符,当re.DOTALL标记被指定时,则可以匹配包括换行符的任意字符。

    [...]

    用来表示一组字符,单独列出:[amk] 匹配 ‘a‘,‘m‘或‘k‘

    [^...]

    不在[]中的字符:[^abc] 匹配除了a,b,c之外的字符。

    re*

    匹配0个或多个的表达式。

    re+

    匹配1个或多个的表达式。

    re?

    匹配0个或1个由前面的正则表达式定义的片段,非贪婪方式

    re{ n}

    精确匹配 n 个前面表达式。例如, o{2} 不能匹配 "Bob" 中的 "o",但是能匹配 "food" 中的两个 o。

    re{ n,}

    匹配 n 个前面表达式。例如, o{2,} 不能匹配"Bob"中的"o",但能匹配 "foooood"中的所有 o。"o{1,}" 等价于 "o+"。"o{0,}" 则等价于 "o*"。

    re{ n, m}

    匹配 n 到 m 次由前面的正则表达式定义的片段,贪婪方式

    a| b

    匹配a或b

    (re)

    对正则表达式分组并记住匹配的文本

    (?imx)

    正则表达式包含三种可选标志:i, m, 或 x 。只影响括号中的区域。

    (?-imx)

    正则表达式关闭 i, m, 或 x 可选标志。只影响括号中的区域。

    (?: re)

    类似 (...), 但是不表示一个组

    (?imx: re)

    在括号中使用i, m, 或 x 可选标志

    (?-imx: re)

    在括号中不使用i, m, 或 x 可选标志

    (?#...)

    注释.

    (?= re)

    前向肯定界定符。如果所含正则表达式,以 ... 表示,在当前位置成功匹配时成功,否则失败。但一旦所含表达式已经尝试,匹配引擎根本没有提高;模式的剩余部分还要尝试界定符的右边。

    (?! re)

    前向否定界定符。与肯定界定符相反;当所含表达式不能在字符串当前位置匹配时成功

    (?> re)

    匹配的独立模式,省去回溯。

    \w

    匹配字母数字及下划线

    \W

    匹配非字母数字及下划线

    \s

    匹配任意空白字符,等价于 [\t\n\r\f].

    \S

    匹配任意非空字符

    \d

    匹配任意数字,等价于 [0-9].

    \D

    匹配任意非数字

    \A

    匹配字符串开始

    \Z

    匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串。

    \z

    匹配字符串结束

    \G

    匹配最后匹配完成的位置。

    \b

    匹配一个单词边界,也就是指单词和空格间的位置。例如, ‘er\b‘ 可以匹配"never" 中的 ‘er‘,但不能匹配 "verb" 中的 ‘er‘。

    \B

    匹配非单词边界。‘er\B‘ 能匹配 "verb" 中的 ‘er‘,但不能匹配 "never" 中的 ‘er‘。

    \n, \t, 等.

    匹配一个换行符。匹配一个制表符。等

    \1...\9

    匹配第n个分组的内容。

    \10

    匹配第n个分组的内容,如果它经匹配。否则指的是八进制字符码的表达式。

    3.清洗(用python进行清洗)

    replace(‘a‘,‘b‘),用b替换a,进行初步清洗,也可以用MapReduce进行清洗。

    4.存入文件

    fb = open(‘%s.txt‘ % title,‘w‘,encoding=‘utf-8‘)

    建立文件,并且该文件为写入状态,其中%s是占位符,也就是用% title 进行替换  chapter_url = "http://www.92kshu.cc%s" %chapter_url

    这段代码为连接字符串,与+相比,%s能够节省内存

    fb.write(String)就是来写入文件的语句

    源代码:

    1 #down web pages

    2

    3 importrequests4 importre5

    6 url = ‘http://www.92kshu.cc/69509/‘

    7 response =requests.get(url)8 response.encoding = ‘gbk‘

    9 html =response.text10 title = re.findall(r‘‘,html)[0]11 fb = open(‘%s.txt‘ % title,‘w‘,encoding=‘utf-8‘)12 #获取每章的内容

    13 #print(html)

    14 dl = re.findall(r‘

    正文
    ‘,html)[0]15 print(dl)16 chapter_info_list = re.findall(r‘(.*?)‘,dl)17 #print(chapter_info_list)

    18 for chapter_info inchapter_info_list:19 chapter_url,chapter_title =chapter_info20 chapter_url = "http://www.92kshu.cc%s" %chapter_url21 #print(chapter_url)

    22 chapter_response =requests.get(chapter_url)23 chapter_response.encoding = ‘gbk‘

    24 chapter_html =chapter_response.text25 chapter_content = re.findall(r‘

    (.*?)>
    ‘,chapter_html)[0]26 #print(chapter_content)

    27 chapter_content = chapter_content.replace(‘

    ‘,‘‘)28 chapter_content = chapter_content.replace(‘

    ‘,‘‘)29 fb.write(chapter_title)30 fb.write(chapter_content)31 fb.write(‘\n‘)32 print(chapter_url)

    原文:https://www.cnblogs.com/flw0322/p/12252246.html

    展开全文
  • * txt小说下载 * * @author Administrator * */ public class MyTxt { static String url = "https://www.biqiuge.com/book/5954/3816992.html"; public static void main(String[] args) { createFile...

    在这里插入图片描述

    package com.zhjy.txt;
    
    import java.io.BufferedWriter;
    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.OutputStreamWriter;
    import java.io.UnsupportedEncodingException;
    import java.util.regex.Matcher;
    
    import org.jsoup.Connection;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    /**
     * txt小说下载
     * 
     * @author Administrator
     *
     */
    public class MyTxt {
    
    	static String url = "https://www.biqiuge.com/book/5954/3816992.html";
    
    	public static void main(String[] args) {
    
    		createFile();
    		parse(url);
    
    	}
    
    	private static void parse(String serverString) {
    		System.out.println(serverString);
    		// 可以使用Jsoup自带的网络请求方式:
    		Document document = null;
    		try {
    			Connection conn = Jsoup.connect(serverString).timeout(5000);
    			conn.header("User-Agent",
    					"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0");
    			document = conn.get();
    		} catch (IOException e) {
    			e.printStackTrace();
    		} catch (Exception e) {
    			e.printStackTrace();
    			parse(url);
    		}
    
    		// String string = document.toString();
    		// System.out.println("document:" + string);
    
    		// 解析xml
    		// document = (Document) Jsoup.parse(serverString);
    		if (document == null) {
    			parse(url);
    			return;
    		}
    
    		Elements title = document.select("div");// 得到table标签中的内容
    		for (Element item : title) {
    			String name = item.attr("class");
    			if (name.equals("content")) {
    				Elements h1 = item.select("h1");
    				String txt = h1.text();
    				getTxt(txt);
    				System.out.println(txt);
    			}
    		}
    
    		Elements div = document.select("div");// 得到table标签中的内容
    		for (Element item : div) {
    			// System.out.println("--------------------------");
    			// System.out.println(item);
    
    			String name = item.attr("id");
    			if (name.equals("content")) {
    				System.out.println(item.text().length());
    				String[] line = item.text().split(" ");
    				int n = line.length;
    				for (int i = 0; i < n; i++) {
    					getTxt(line[i]);
    				}
    			}
    		}
    
    		Elements div1 = document.select("div");// 得到table标签中的内容
    		for (Element item : div1) {
    			String name = item.attr("class");
    			if (name.equals("page_chapter")) {
    
    				Elements a = item.select("a");
    				for (Element item1 : a) {
    					String name1 = item1.text();
    					if (name1.equals("下一章")) {
    						String href = item1.attr("href");
    						System.out.println(href);
    
    						if (!href.contains(".html")) {
    							endTxt();
    						} else {
    							url = "https://www.biqiuge.com" + href;
    							parse(url);
    
    						}
    					}
    				}
    			}
    		}
    	}
    
    	public static void getTxt(String msg) {
    		String t = msg;
    		Matcher matcher = Patterns.WEB_URL.matcher(msg);
    		if (matcher.find()) {
    			// System.out.println(matcher.group());
    			t = t.replace(matcher.group(), "");
    		}
    		saveTxt(t + "\r\n");
    	}
    
    	public static void endTxt() {
    		try {
    			writer.close();
    		} catch (IOException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		}
    	}
    
    	static BufferedWriter writer;
    
    	public static void createFile() {
    		File f = new File("G:\\txt\\2.txt");
    		FileOutputStream writerStream = null;
    		try {
    			writerStream = new FileOutputStream(f, true);
    			writer = new BufferedWriter(new OutputStreamWriter(writerStream,
    					"UTF-8"));
    		} catch (FileNotFoundException e) {
    			e.printStackTrace();
    		} catch (UnsupportedEncodingException e) {
    			e.printStackTrace();
    		} catch (IOException e) {
    			e.printStackTrace();
    		}
    	}
    
    	public static void saveTxt(String msg) {
    		try {
    			writer.write(msg);
    			writer.flush();
    		} catch (FileNotFoundException e) {
    			e.printStackTrace();
    		} catch (UnsupportedEncodingException e) {
    			e.printStackTrace();
    		} catch (IOException e) {
    			e.printStackTrace();
    		}
    	}
    }
    
    
    package com.zhjy.txt;
    
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Patterns {
    	/**
    	 * Regular expression to match all IANA top-level domains. List accurate as
    	 * of 2011/07/18. List taken from:
    	 * http://data.iana.org/TLD/tlds-alpha-by-domain.txt This pattern is
    	 * auto-generated by frameworks/ex/common/tools/make-iana-tld-pattern.py
    	 *
    	 * @deprecated Due to the recent profileration of gTLDs, this API is
    	 *             expected to become out-of-date very quickly. Therefore it is
    	 *             now deprecated.
    	 */
    	@Deprecated
    	public static final String TOP_LEVEL_DOMAIN_STR = "((aero|arpa|asia|a[cdefgilmnoqrstuwxz])"
    			+ "|(biz|b[abdefghijmnorstvwyz])"
    			+ "|(cat|com|coop|c[acdfghiklmnoruvxyz])"
    			+ "|d[ejkmoz]"
    			+ "|(edu|e[cegrstu])"
    			+ "|f[ijkmor]"
    			+ "|(gov|g[abdefghilmnpqrstuwy])"
    			+ "|h[kmnrtu]"
    			+ "|(info|int|i[delmnoqrst])"
    			+ "|(jobs|j[emop])"
    			+ "|k[eghimnprwyz]"
    			+ "|l[abcikrstuvy]"
    			+ "|(mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])"
    			+ "|(name|net|n[acefgilopruz])"
    			+ "|(org|om)"
    			+ "|(pro|p[aefghklmnrstwy])"
    			+ "|qa"
    			+ "|r[eosuw]"
    			+ "|s[abcdeghijklmnortuvyz]"
    			+ "|(tel|travel|t[cdfghjklmnoprtvwz])"
    			+ "|u[agksyz]"
    			+ "|v[aceginu]"
    			+ "|w[fs]"
    			+ "|(\u03b4\u03bf\u03ba\u03b9\u03bc\u03ae|\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435|\u0440\u0444|\u0441\u0440\u0431|\u05d8\u05e2\u05e1\u05d8|\u0622\u0632\u0645\u0627\u06cc\u0634\u06cc|\u0625\u062e\u062a\u0628\u0627\u0631|\u0627\u0644\u0627\u0631\u062f\u0646|\u0627\u0644\u062c\u0632\u0627\u0626\u0631|\u0627\u0644\u0633\u0639\u0648\u062f\u064a\u0629|\u0627\u0644\u0645\u063a\u0631\u0628|\u0627\u0645\u0627\u0631\u0627\u062a|\u0628\u06be\u0627\u0631\u062a|\u062a\u0648\u0646\u0633|\u0633\u0648\u0631\u064a\u0629|\u0641\u0644\u0633\u0637\u064a\u0646|\u0642\u0637\u0631|\u0645\u0635\u0631|\u092a\u0930\u0940\u0915\u094d\u0937\u093e|\u092d\u093e\u0930\u0924|\u09ad\u09be\u09b0\u09a4|\u0a2d\u0a3e\u0a30\u0a24|\u0aad\u0abe\u0ab0\u0aa4|\u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe|\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8|\u0b9a\u0bbf\u0b99\u0bcd\u0b95\u0baa\u0bcd\u0baa\u0bc2\u0bb0\u0bcd|\u0baa\u0bb0\u0bbf\u0b9f\u0bcd\u0b9a\u0bc8|\u0c2d\u0c3e\u0c30\u0c24\u0c4d|\u0dbd\u0d82\u0d9a\u0dcf|\u0e44\u0e17\u0e22|\u30c6\u30b9\u30c8|\u4e2d\u56fd|\u4e2d\u570b|\u53f0\u6e7e|\u53f0\u7063|\u65b0\u52a0\u5761|\u6d4b\u8bd5|\u6e2c\u8a66|\u9999\u6e2f|\ud14c\uc2a4\ud2b8|\ud55c\uad6d|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)"
    			+ "|y[et]" + "|z[amw])";
    
    	/**
    	 * Regular expression pattern to match all IANA top-level domains.
    	 * 
    	 * @deprecated This API is deprecated. See {@link #TOP_LEVEL_DOMAIN_STR}.
    	 */
    	@Deprecated
    	public static final Pattern TOP_LEVEL_DOMAIN = Pattern
    			.compile(TOP_LEVEL_DOMAIN_STR);
    
    	/**
    	 * Regular expression to match all IANA top-level domains for WEB_URL. List
    	 * accurate as of 2011/07/18. List taken from:
    	 * http://data.iana.org/TLD/tlds-alpha-by-domain.txt This pattern is
    	 * auto-generated by frameworks/ex/common/tools/make-iana-tld-pattern.py
    	 *
    	 * @deprecated This API is deprecated. See {@link #TOP_LEVEL_DOMAIN_STR}.
    	 */
    	@Deprecated
    	public static final String TOP_LEVEL_DOMAIN_STR_FOR_WEB_URL = "(?:"
    			+ "(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])"
    			+ "|(?:biz|b[abdefghijmnorstvwyz])"
    			+ "|(?:cat|com|coop|c[acdfghiklmnoruvxyz])"
    			+ "|d[ejkmoz]"
    			+ "|(?:edu|e[cegrstu])"
    			+ "|f[ijkmor]"
    			+ "|(?:gov|g[abdefghilmnpqrstuwy])"
    			+ "|h[kmnrtu]"
    			+ "|(?:info|int|i[delmnoqrst])"
    			+ "|(?:jobs|j[emop])"
    			+ "|k[eghimnprwyz]"
    			+ "|l[abcikrstuvy]"
    			+ "|(?:mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])"
    			+ "|(?:name|net|n[acefgilopruz])"
    			+ "|(?:org|om)"
    			+ "|(?:pro|p[aefghklmnrstwy])"
    			+ "|qa"
    			+ "|r[eosuw]"
    			+ "|s[abcdeghijklmnortuvyz]"
    			+ "|(?:tel|travel|t[cdfghjklmnoprtvwz])"
    			+ "|u[agksyz]"
    			+ "|v[aceginu]"
    			+ "|w[fs]"
    			+ "|(?:\u03b4\u03bf\u03ba\u03b9\u03bc\u03ae|\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435|\u0440\u0444|\u0441\u0440\u0431|\u05d8\u05e2\u05e1\u05d8|\u0622\u0632\u0645\u0627\u06cc\u0634\u06cc|\u0625\u062e\u062a\u0628\u0627\u0631|\u0627\u0644\u0627\u0631\u062f\u0646|\u0627\u0644\u062c\u0632\u0627\u0626\u0631|\u0627\u0644\u0633\u0639\u0648\u062f\u064a\u0629|\u0627\u0644\u0645\u063a\u0631\u0628|\u0627\u0645\u0627\u0631\u0627\u062a|\u0628\u06be\u0627\u0631\u062a|\u062a\u0648\u0646\u0633|\u0633\u0648\u0631\u064a\u0629|\u0641\u0644\u0633\u0637\u064a\u0646|\u0642\u0637\u0631|\u0645\u0635\u0631|\u092a\u0930\u0940\u0915\u094d\u0937\u093e|\u092d\u093e\u0930\u0924|\u09ad\u09be\u09b0\u09a4|\u0a2d\u0a3e\u0a30\u0a24|\u0aad\u0abe\u0ab0\u0aa4|\u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe|\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8|\u0b9a\u0bbf\u0b99\u0bcd\u0b95\u0baa\u0bcd\u0baa\u0bc2\u0bb0\u0bcd|\u0baa\u0bb0\u0bbf\u0b9f\u0bcd\u0b9a\u0bc8|\u0c2d\u0c3e\u0c30\u0c24\u0c4d|\u0dbd\u0d82\u0d9a\u0dcf|\u0e44\u0e17\u0e22|\u30c6\u30b9\u30c8|\u4e2d\u56fd|\u4e2d\u570b|\u53f0\u6e7e|\u53f0\u7063|\u65b0\u52a0\u5761|\u6d4b\u8bd5|\u6e2c\u8a66|\u9999\u6e2f|\ud14c\uc2a4\ud2b8|\ud55c\uad6d|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)"
    			+ "|y[et]" + "|z[amw]))";
    
    	/**
    	 * Regular expression to match all IANA top-level domains.
    	 *
    	 * List accurate as of 2015/11/24. List taken from:
    	 * http://data.iana.org/TLD/tlds-alpha-by-domain.txt This pattern is
    	 * auto-generated by frameworks/ex/common/tools/make-iana-tld-pattern.py
    	 *
    	 * @hide
    	 */
    	static final String IANA_TOP_LEVEL_DOMAINS = "(?:"
    			+ "(?:aaa|aarp|abb|abbott|abogado|academy|accenture|accountant|accountants|aco|active"
    			+ "|actor|ads|adult|aeg|aero|afl|agency|aig|airforce|airtel|allfinanz|alsace|amica|amsterdam"
    			+ "|android|apartments|app|apple|aquarelle|aramco|archi|army|arpa|arte|asia|associates"
    			+ "|attorney|auction|audio|auto|autos|axa|azure|a[cdefgilmoqrstuwxz])"
    			+ "|(?:band|bank|bar|barcelona|barclaycard|barclays|bargains|bauhaus|bayern|bbc|bbva"
    			+ "|bcn|beats|beer|bentley|berlin|best|bet|bharti|bible|bid|bike|bing|bingo|bio|biz|black"
    			+ "|blackfriday|bloomberg|blue|bms|bmw|bnl|bnpparibas|boats|bom|bond|boo|boots|boutique"
    			+ "|bradesco|bridgestone|broadway|broker|brother|brussels|budapest|build|builders|business"
    			+ "|buzz|bzh|b[abdefghijmnorstvwyz])"
    			+ "|(?:cab|cafe|cal|camera|camp|cancerresearch|canon|capetown|capital|car|caravan|cards"
    			+ "|care|career|careers|cars|cartier|casa|cash|casino|cat|catering|cba|cbn|ceb|center|ceo"
    			+ "|cern|cfa|cfd|chanel|channel|chat|cheap|chloe|christmas|chrome|church|cipriani|cisco"
    			+ "|citic|city|cityeats|claims|cleaning|click|clinic|clothing|cloud|club|clubmed|coach"
    			+ "|codes|coffee|college|cologne|com|commbank|community|company|computer|comsec|condos"
    			+ "|construction|consulting|contractors|cooking|cool|coop|corsica|country|coupons|courses"
    			+ "|credit|creditcard|creditunion|cricket|crown|crs|cruises|csc|cuisinella|cymru|cyou|c[acdfghiklmnoruvwxyz])"
    			+ "|(?:dabur|dad|dance|date|dating|datsun|day|dclk|deals|degree|delivery|dell|delta"
    			+ "|democrat|dental|dentist|desi|design|dev|diamonds|diet|digital|direct|directory|discount"
    			+ "|dnp|docs|dog|doha|domains|doosan|download|drive|durban|dvag|d[ejkmoz])"
    			+ "|(?:earth|eat|edu|education|email|emerck|energy|engineer|engineering|enterprises"
    			+ "|epson|equipment|erni|esq|estate|eurovision|eus|events|everbank|exchange|expert|exposed"
    			+ "|express|e[cegrstu])"
    			+ "|(?:fage|fail|fairwinds|faith|family|fan|fans|farm|fashion|feedback|ferrero|film"
    			+ "|final|finance|financial|firmdale|fish|fishing|fit|fitness|flights|florist|flowers|flsmidth"
    			+ "|fly|foo|football|forex|forsale|forum|foundation|frl|frogans|fund|furniture|futbol|fyi"
    			+ "|f[ijkmor])"
    			+ "|(?:gal|gallery|game|garden|gbiz|gdn|gea|gent|genting|ggee|gift|gifts|gives|giving"
    			+ "|glass|gle|global|globo|gmail|gmo|gmx|gold|goldpoint|golf|goo|goog|google|gop|gov|grainger"
    			+ "|graphics|gratis|green|gripe|group|gucci|guge|guide|guitars|guru|g[abdefghilmnpqrstuwy])"
    			+ "|(?:hamburg|hangout|haus|healthcare|help|here|hermes|hiphop|hitachi|hiv|hockey|holdings"
    			+ "|holiday|homedepot|homes|honda|horse|host|hosting|hoteles|hotmail|house|how|hsbc|hyundai"
    			+ "|h[kmnrtu])"
    			+ "|(?:ibm|icbc|ice|icu|ifm|iinet|immo|immobilien|industries|infiniti|info|ing|ink|institute"
    			+ "|insure|int|international|investments|ipiranga|irish|ist|istanbul|itau|iwc|i[delmnoqrst])"
    			+ "|(?:jaguar|java|jcb|jetzt|jewelry|jlc|jll|jobs|joburg|jprs|juegos|j[emop])"
    			+ "|(?:kaufen|kddi|kia|kim|kinder|kitchen|kiwi|koeln|komatsu|krd|kred|kyoto|k[eghimnprwyz])"
    			+ "|(?:lacaixa|lancaster|land|landrover|lasalle|lat|latrobe|law|lawyer|lds|lease|leclerc"
    			+ "|legal|lexus|lgbt|liaison|lidl|life|lifestyle|lighting|limited|limo|linde|link|live"
    			+ "|lixil|loan|loans|lol|london|lotte|lotto|love|ltd|ltda|lupin|luxe|luxury|l[abcikrstuvy])"
    			+ "|(?:madrid|maif|maison|man|management|mango|market|marketing|markets|marriott|mba"
    			+ "|media|meet|melbourne|meme|memorial|men|menu|meo|miami|microsoft|mil|mini|mma|mobi|moda"
    			+ "|moe|moi|mom|monash|money|montblanc|mormon|mortgage|moscow|motorcycles|mov|movie|movistar"
    			+ "|mtn|mtpc|mtr|museum|mutuelle|m[acdeghklmnopqrstuvwxyz])"
    			+ "|(?:nadex|nagoya|name|navy|nec|net|netbank|network|neustar|new|news|nexus|ngo|nhk"
    			+ "|nico|ninja|nissan|nokia|nra|nrw|ntt|nyc|n[acefgilopruz])"
    			+ "|(?:obi|office|okinawa|omega|one|ong|onl|online|ooo|oracle|orange|org|organic|osaka"
    			+ "|otsuka|ovh|om)"
    			+ "|(?:page|panerai|paris|partners|parts|party|pet|pharmacy|philips|photo|photography"
    			+ "|photos|physio|piaget|pics|pictet|pictures|ping|pink|pizza|place|play|playstation|plumbing"
    			+ "|plus|pohl|poker|porn|post|praxi|press|pro|prod|productions|prof|properties|property"
    			+ "|protection|pub|p[aefghklmnrstwy])"
    			+ "|(?:qpon|quebec|qa)"
    			+ "|(?:racing|realtor|realty|recipes|red|redstone|rehab|reise|reisen|reit|ren|rent|rentals"
    			+ "|repair|report|republican|rest|restaurant|review|reviews|rich|ricoh|rio|rip|rocher|rocks"
    			+ "|rodeo|rsvp|ruhr|run|rwe|ryukyu|r[eosuw])"
    			+ "|(?:saarland|sakura|sale|samsung|sandvik|sandvikcoromant|sanofi|sap|sapo|sarl|saxo"
    			+ "|sbs|sca|scb|schmidt|scholarships|school|schule|schwarz|science|scor|scot|seat|security"
    			+ "|seek|sener|services|seven|sew|sex|sexy|shiksha|shoes|show|shriram|singles|site|ski"
    			+ "|sky|skype|sncf|soccer|social|software|sohu|solar|solutions|sony|soy|space|spiegel|spreadbetting"
    			+ "|srl|stada|starhub|statoil|stc|stcgroup|stockholm|studio|study|style|sucks|supplies"
    			+ "|supply|support|surf|surgery|suzuki|swatch|swiss|sydney|systems|s[abcdeghijklmnortuvxyz])"
    			+ "|(?:tab|taipei|tatamotors|tatar|tattoo|tax|taxi|team|tech|technology|tel|telefonica"
    			+ "|temasek|tennis|thd|theater|theatre|tickets|tienda|tips|tires|tirol|today|tokyo|tools"
    			+ "|top|toray|toshiba|tours|town|toyota|toys|trade|trading|training|travel|trust|tui|t[cdfghjklmnortvwz])"
    			+ "|(?:ubs|university|uno|uol|u[agksyz])"
    			+ "|(?:vacations|vana|vegas|ventures|versicherung|vet|viajes|video|villas|vin|virgin"
    			+ "|vision|vista|vistaprint|viva|vlaanderen|vodka|vote|voting|voto|voyage|v[aceginu])"
    			+ "|(?:wales|walter|wang|watch|webcam|website|wed|wedding|weir|whoswho|wien|wiki|williamhill"
    			+ "|win|windows|wine|wme|work|works|world|wtc|wtf|w[fs])"
    			+ "|(?:\u03b5\u03bb|\u0431\u0435\u043b|\u0434\u0435\u0442\u0438|\u043a\u043e\u043c|\u043c\u043a\u0434"
    			+ "|\u043c\u043e\u043d|\u043c\u043e\u0441\u043a\u0432\u0430|\u043e\u043d\u043b\u0430\u0439\u043d"
    			+ "|\u043e\u0440\u0433|\u0440\u0443\u0441|\u0440\u0444|\u0441\u0430\u0439\u0442|\u0441\u0440\u0431"
    			+ "|\u0443\u043a\u0440|\u049b\u0430\u0437|\u0570\u0561\u0575|\u05e7\u05d5\u05dd|\u0627\u0631\u0627\u0645\u0643\u0648"
    			+ "|\u0627\u0644\u0627\u0631\u062f\u0646|\u0627\u0644\u062c\u0632\u0627\u0626\u0631|\u0627\u0644\u0633\u0639\u0648\u062f\u064a\u0629"
    			+ "|\u0627\u0644\u0645\u063a\u0631\u0628|\u0627\u0645\u0627\u0631\u0627\u062a|\u0627\u06cc\u0631\u0627\u0646"
    			+ "|\u0628\u0627\u0632\u0627\u0631|\u0628\u06be\u0627\u0631\u062a|\u062a\u0648\u0646\u0633"
    			+ "|\u0633\u0648\u062f\u0627\u0646|\u0633\u0648\u0631\u064a\u0629|\u0634\u0628\u0643\u0629"
    			+ "|\u0639\u0631\u0627\u0642|\u0639\u0645\u0627\u0646|\u0641\u0644\u0633\u0637\u064a\u0646"
    			+ "|\u0642\u0637\u0631|\u0643\u0648\u0645|\u0645\u0635\u0631|\u0645\u0644\u064a\u0633\u064a\u0627"
    			+ "|\u0645\u0648\u0642\u0639|\u0915\u0949\u092e|\u0928\u0947\u091f|\u092d\u093e\u0930\u0924"
    			+ "|\u0938\u0902\u0917\u0920\u0928|\u09ad\u09be\u09b0\u09a4|\u0a2d\u0a3e\u0a30\u0a24|\u0aad\u0abe\u0ab0\u0aa4"
    			+ "|\u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe|\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8|\u0b9a\u0bbf\u0b99\u0bcd\u0b95\u0baa\u0bcd\u0baa\u0bc2\u0bb0\u0bcd"
    			+ "|\u0c2d\u0c3e\u0c30\u0c24\u0c4d|\u0dbd\u0d82\u0d9a\u0dcf|\u0e04\u0e2d\u0e21|\u0e44\u0e17\u0e22"
    			+ "|\u10d2\u10d4|\u307f\u3093\u306a|\u30b0\u30fc\u30b0\u30eb|\u30b3\u30e0|\u4e16\u754c"
    			+ "|\u4e2d\u4fe1|\u4e2d\u56fd|\u4e2d\u570b|\u4e2d\u6587\u7f51|\u4f01\u4e1a|\u4f5b\u5c71"
    			+ "|\u4fe1\u606f|\u5065\u5eb7|\u516b\u5366|\u516c\u53f8|\u516c\u76ca|\u53f0\u6e7e|\u53f0\u7063"
    			+ "|\u5546\u57ce|\u5546\u5e97|\u5546\u6807|\u5728\u7ebf|\u5927\u62ff|\u5a31\u4e50|\u5de5\u884c"
    			+ "|\u5e7f\u4e1c|\u6148\u5584|\u6211\u7231\u4f60|\u624b\u673a|\u653f\u52a1|\u653f\u5e9c"
    			+ "|\u65b0\u52a0\u5761|\u65b0\u95fb|\u65f6\u5c1a|\u673a\u6784|\u6de1\u9a6c\u9521|\u6e38\u620f"
    			+ "|\u70b9\u770b|\u79fb\u52a8|\u7ec4\u7ec7\u673a\u6784|\u7f51\u5740|\u7f51\u5e97|\u7f51\u7edc"
    			+ "|\u8c37\u6b4c|\u96c6\u56e2|\u98de\u5229\u6d66|\u9910\u5385|\u9999\u6e2f|\ub2f7\ub137"
    			+ "|\ub2f7\ucef4|\uc0bc\uc131|\ud55c\uad6d|xbox"
    			+ "|xerox|xin|xn\\-\\-11b4c3d|xn\\-\\-1qqw23a|xn\\-\\-30rr7y|xn\\-\\-3bst00m|xn\\-\\-3ds443g"
    			+ "|xn\\-\\-3e0b707e|xn\\-\\-3pxu8k|xn\\-\\-42c2d9a|xn\\-\\-45brj9c|xn\\-\\-45q11c|xn\\-\\-4gbrim"
    			+ "|xn\\-\\-55qw42g|xn\\-\\-55qx5d|xn\\-\\-6frz82g|xn\\-\\-6qq986b3xl|xn\\-\\-80adxhks"
    			+ "|xn\\-\\-80ao21a|xn\\-\\-80asehdb|xn\\-\\-80aswg|xn\\-\\-90a3ac|xn\\-\\-90ais|xn\\-\\-9dbq2a"
    			+ "|xn\\-\\-9et52u|xn\\-\\-b4w605ferd|xn\\-\\-c1avg|xn\\-\\-c2br7g|xn\\-\\-cg4bki|xn\\-\\-clchc0ea0b2g2a9gcd"
    			+ "|xn\\-\\-czr694b|xn\\-\\-czrs0t|xn\\-\\-czru2d|xn\\-\\-d1acj3b|xn\\-\\-d1alf|xn\\-\\-efvy88h"
    			+ "|xn\\-\\-estv75g|xn\\-\\-fhbei|xn\\-\\-fiq228c5hs|xn\\-\\-fiq64b|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s"
    			+ "|xn\\-\\-fjq720a|xn\\-\\-flw351e|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-gecrj9c"
    			+ "|xn\\-\\-h2brj9c|xn\\-\\-hxt814e|xn\\-\\-i1b6b1a6a2e|xn\\-\\-imr513n|xn\\-\\-io0a7i"
    			+ "|xn\\-\\-j1aef|xn\\-\\-j1amh|xn\\-\\-j6w193g|xn\\-\\-kcrx77d1x4a|xn\\-\\-kprw13d|xn\\-\\-kpry57d"
    			+ "|xn\\-\\-kput3i|xn\\-\\-l1acc|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgb9awbf|xn\\-\\-mgba3a3ejt"
    			+ "|xn\\-\\-mgba3a4f16a|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbab2bd|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e"
    			+ "|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-mgbpl2fh|xn\\-\\-mgbtx2b|xn\\-\\-mgbx4cd0ab"
    			+ "|xn\\-\\-mk1bu44c|xn\\-\\-mxtq1m|xn\\-\\-ngbc5azd|xn\\-\\-node|xn\\-\\-nqv7f|xn\\-\\-nqv7fs00ema"
    			+ "|xn\\-\\-nyqy26a|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1acf|xn\\-\\-p1ai|xn\\-\\-pgbs0dh"
    			+ "|xn\\-\\-pssy2u|xn\\-\\-q9jyb4c|xn\\-\\-qcka1pmc|xn\\-\\-qxam|xn\\-\\-rhqv96g|xn\\-\\-s9brj9c"
    			+ "|xn\\-\\-ses554g|xn\\-\\-t60b56a|xn\\-\\-tckwe|xn\\-\\-unup4y|xn\\-\\-vermgensberater\\-ctb"
    			+ "|xn\\-\\-vermgensberatung\\-pwb|xn\\-\\-vhquv|xn\\-\\-vuq861b|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a"
    			+ "|xn\\-\\-xhq521b|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-y9a3aq|xn\\-\\-yfro4i67o"
    			+ "|xn\\-\\-ygbi2ammx|xn\\-\\-zfr164b|xperia|xxx|xyz)"
    			+ "|(?:yachts|yamaxun|yandex|yodobashi|yoga|yokohama|youtube|y[et])"
    			+ "|(?:zara|zip|zone|zuerich|z[amw]))";
    
    	/**
    	 * Kept for backward compatibility reasons.
    	 *
    	 * @deprecated Deprecated since it does not include all IRI characters
    	 *             defined in RFC 3987
    	 */
    	@Deprecated
    	public static final String GOOD_IRI_CHAR = "a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF";
    
    	public static final Pattern IP_ADDRESS = Pattern
    			.compile("((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(25[0-5]|2[0-4]"
    					+ "[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]"
    					+ "[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}"
    					+ "|[1-9][0-9]|[0-9]))");
    
    	/**
    	 * Valid UCS characters defined in RFC 3987. Excludes space characters.
    	 */
    	private static final String UCS_CHAR = "[" + "\u00A0-\uD7FF"
    			+ "\uF900-\uFDCF" + "\uFDF0-\uFFEF" + "\uD800\uDC00-\uD83F\uDFFD"
    			+ "\uD840\uDC00-\uD87F\uDFFD" + "\uD880\uDC00-\uD8BF\uDFFD"
    			+ "\uD8C0\uDC00-\uD8FF\uDFFD" + "\uD900\uDC00-\uD93F\uDFFD"
    			+ "\uD940\uDC00-\uD97F\uDFFD" + "\uD980\uDC00-\uD9BF\uDFFD"
    			+ "\uD9C0\uDC00-\uD9FF\uDFFD" + "\uDA00\uDC00-\uDA3F\uDFFD"
    			+ "\uDA40\uDC00-\uDA7F\uDFFD" + "\uDA80\uDC00-\uDABF\uDFFD"
    			+ "\uDAC0\uDC00-\uDAFF\uDFFD" + "\uDB00\uDC00-\uDB3F\uDFFD"
    			+ "\uDB44\uDC00-\uDB7F\uDFFD"
    			+ "&&[^\u00A0[\u2000-\u200A]\u2028\u2029\u202F\u3000]]";
    
    	/**
    	 * Valid characters for IRI label defined in RFC 3987.
    	 */
    	private static final String LABEL_CHAR = "a-zA-Z0-9" + UCS_CHAR;
    
    	/**
    	 * Valid characters for IRI TLD defined in RFC 3987.
    	 */
    	private static final String TLD_CHAR = "a-zA-Z" + UCS_CHAR;
    
    	/**
    	 * RFC 1035 Section 2.3.4 limits the labels to a maximum 63 octets.
    	 */
    	private static final String IRI_LABEL = "[" + LABEL_CHAR + "](?:["
    			+ LABEL_CHAR + "\\-]{0,61}[" + LABEL_CHAR + "]){0,1}";
    
    	/**
    	 * RFC 3492 references RFC 1034 and limits Punycode algorithm output to 63
    	 * characters.
    	 */
    	private static final String PUNYCODE_TLD = "xn\\-\\-[\\w\\-]{0,58}\\w";
    
    	private static final String TLD = "(" + PUNYCODE_TLD + "|" + "[" + TLD_CHAR
    			+ "]{2,63}" + ")";
    
    	private static final String HOST_NAME = "(" + IRI_LABEL + "\\.)+" + TLD;
    
    	public static final Pattern DOMAIN_NAME = Pattern.compile("(" + HOST_NAME
    			+ "|" + IP_ADDRESS + ")");
    
    	private static final String PROTOCOL = "(?i:http|https|rtsp):\\/\\/";
    
    	/*
    	 * A word boundary or end of input. This is to stop foo.sure from matching
    	 * as foo.su
    	 */
    	private static final String WORD_BOUNDARY = "(?:\\b|$|^)";
    
    	private static final String USER_INFO = "(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)"
    			+ "\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_"
    			+ "\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@";
    
    	private static final String PORT_NUMBER = "\\:\\d{1,5}";
    
    	private static final String PATH_AND_QUERY = "\\/(?:(?:[" + LABEL_CHAR
    			+ "\\;\\/\\?\\:\\@\\&\\=\\#\\~" // plus optional query params
    			+ "\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*";
    
    	/**
    	 * Regular expression pattern to match most part of RFC 3987
    	 * Internationalized URLs, aka IRIs.
    	 */
    	public static final Pattern WEB_URL = Pattern.compile("(" + "(" + "(?:"
    			+ PROTOCOL + "(?:" + USER_INFO + ")?" + ")?" + "(?:" + DOMAIN_NAME
    			+ ")" + "(?:" + PORT_NUMBER + ")?" + ")" + "(" + PATH_AND_QUERY
    			+ ")?" + WORD_BOUNDARY + ")");
    
    	/**
    	 * Regular expression that matches known TLDs and punycode TLDs
    	 */
    	private static final String STRICT_TLD = "(?:" + IANA_TOP_LEVEL_DOMAINS
    			+ "|" + PUNYCODE_TLD + ")";
    
    	/**
    	 * Regular expression that matches host names using {@link #STRICT_TLD}
    	 */
    	private static final String STRICT_HOST_NAME = "(?:(?:" + IRI_LABEL
    			+ "\\.)+" + STRICT_TLD + ")";
    
    	/**
    	 * Regular expression that matches domain names using either
    	 * {@link #STRICT_HOST_NAME} or {@link #IP_ADDRESS}
    	 */
    	private static final Pattern STRICT_DOMAIN_NAME = Pattern.compile("(?:"
    			+ STRICT_HOST_NAME + "|" + IP_ADDRESS + ")");
    
    	/**
    	 * Regular expression that matches domain names without a TLD
    	 */
    	private static final String RELAXED_DOMAIN_NAME = "(?:" + "(?:" + IRI_LABEL
    			+ "(?:\\.(?=\\S))" + "?)+" + "|" + IP_ADDRESS + ")";
    
    	/**
    	 * Regular expression to match strings that do not start with a supported
    	 * protocol. The TLDs are expected to be one of the known TLDs.
    	 */
    	private static final String WEB_URL_WITHOUT_PROTOCOL = "(" + WORD_BOUNDARY
    			+ "(?<!:\\/\\/)" + "(" + "(?:" + STRICT_DOMAIN_NAME + ")" + "(?:"
    			+ PORT_NUMBER + ")?" + ")" + "(?:" + PATH_AND_QUERY + ")?"
    			+ WORD_BOUNDARY + ")";
    
    	/**
    	 * Regular expression to match strings that start with a supported protocol.
    	 * Rules for domain names and TLDs are more relaxed. TLDs are optional.
    	 */
    	private static final String WEB_URL_WITH_PROTOCOL = "(" + WORD_BOUNDARY
    			+ "(?:" + "(?:" + PROTOCOL + "(?:" + USER_INFO + ")?" + ")" + "(?:"
    			+ RELAXED_DOMAIN_NAME + ")?" + "(?:" + PORT_NUMBER + ")?" + ")"
    			+ "(?:" + PATH_AND_QUERY + ")?" + WORD_BOUNDARY + ")";
    
    	/**
    	 * Regular expression pattern to match IRIs. If a string starts with
    	 * http(s):// the expression tries to match the URL structure with a relaxed
    	 * rule for TLDs. If the string does not start with http(s):// the TLDs are
    	 * expected to be one of the known TLDs.
    	 *
    	 * @hide
    	 */
    	public static final Pattern AUTOLINK_WEB_URL = Pattern.compile("("
    			+ WEB_URL_WITH_PROTOCOL + "|" + WEB_URL_WITHOUT_PROTOCOL + ")");
    
    	/**
    	 * Regular expression for valid email characters. Does not include some of
    	 * the valid characters defined in RFC5321: #&~!^`{}/=$*?|
    	 */
    	private static final String EMAIL_CHAR = LABEL_CHAR + "\\+\\-_%'";
    
    	/**
    	 * Regular expression for local part of an email address. RFC5321 section
    	 * 4.5.3.1.1 limits the local part to be at most 64 octets.
    	 */
    	private static final String EMAIL_ADDRESS_LOCAL_PART = "[" + EMAIL_CHAR
    			+ "]" + "(?:[" + EMAIL_CHAR + "\\.]{1,62}[" + EMAIL_CHAR + "])?";
    
    	/**
    	 * Regular expression for the domain part of an email address. RFC5321
    	 * section 4.5.3.1.2 limits the domain to be at most 255 octets.
    	 */
    	private static final String EMAIL_ADDRESS_DOMAIN = "(?=.{1,255}(?:\\s|$|^))"
    			+ HOST_NAME;
    
    	/**
    	 * Regular expression pattern to match email addresses. It excludes double
    	 * quoted local parts and the special characters #&~!^`{}/=$*?| that are
    	 * included in RFC5321.
    	 * 
    	 * @hide
    	 */
    	public static final Pattern AUTOLINK_EMAIL_ADDRESS = Pattern.compile("("
    			+ WORD_BOUNDARY + "(?:" + EMAIL_ADDRESS_LOCAL_PART + "@"
    			+ EMAIL_ADDRESS_DOMAIN + ")" + WORD_BOUNDARY + ")");
    
    	public static final Pattern EMAIL_ADDRESS = Pattern
    			.compile("[a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}" + "\\@"
    					+ "[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}" + "(" + "\\."
    					+ "[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25}" + ")+");
    
    	/**
    	 * This pattern is intended for searching for things that look like they
    	 * might be phone numbers in arbitrary text, not for validating whether
    	 * something is in fact a phone number. It will miss many things that are
    	 * legitimate phone numbers.
    	 *
    	 * <p>
    	 * The pattern matches the following:
    	 * <ul>
    	 * <li>Optionally, a + sign followed immediately by one or more digits.
    	 * Spaces, dots, or dashes may follow.
    	 * <li>Optionally, sets of digits in parentheses, separated by spaces, dots,
    	 * or dashes.
    	 * <li>A string starting and ending with a digit, containing digits, spaces,
    	 * dots, and/or dashes.
    	 * </ul>
    	 */
    	public static final Pattern PHONE = Pattern.compile( // sdd = space, dot, or
    															// dash
    			"(\\+[0-9]+[\\- \\.]*)?" // +<digits><sdd>*
    					+ "(\\([0-9]+\\)[\\- \\.]*)?" // (<digits>)<sdd>*
    					+ "([0-9][0-9\\- \\.]+[0-9])"); // <digit><digit|sdd>+<digit>
    
    	/**
    	 * Convenience method to take all of the non-null matching groups in a regex
    	 * Matcher and return them as a concatenated string.
    	 *
    	 * @param matcher
    	 *            The Matcher object from which grouped text will be extracted
    	 *
    	 * @return A String comprising all of the non-null matched groups
    	 *         concatenated together
    	 */
    	public static final String concatGroups(Matcher matcher) {
    		StringBuilder b = new StringBuilder();
    		final int numGroups = matcher.groupCount();
    
    		for (int i = 1; i <= numGroups; i++) {
    			String s = matcher.group(i);
    
    			if (s != null) {
    				b.append(s);
    			}
    		}
    
    		return b.toString();
    	}
    
    	/**
    	 * Convenience method to return only the digits and plus signs in the
    	 * matching string.
    	 *
    	 * @param matcher
    	 *            The Matcher object from which digits and plus will be
    	 *            extracted
    	 *
    	 * @return A String comprising all of the digits and plus in the match
    	 */
    	public static final String digitsAndPlusOnly(Matcher matcher) {
    		StringBuilder buffer = new StringBuilder();
    		String matchingRegion = matcher.group();
    
    		for (int i = 0, size = matchingRegion.length(); i < size; i++) {
    			char character = matchingRegion.charAt(i);
    
    			if (character == '+' || Character.isDigit(character)) {
    				buffer.append(character);
    			}
    		}
    		return buffer.toString();
    	}
    
    	/**
    	 * Do not create this static utility class.
    	 */
    	private Patterns() {
    	}
    }
    
    
    展开全文
  • from urllib import request from bs4 import ... #将爬取结果写入文档中 with open("爬虫.txt","w") as filename: filename.write(download_text) 转载于:https://www.cnblogs.com/kimi765096/p/8667197.html

    from urllib import request
    from bs4 import BeautifulSoup
    import webbrowser

    if __name__ == "__main__":
      url= 'http://www.biqukan.com/47_47404/17679470.html'
      webbrowser.open(url)
      head = {}
      head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'
      requ = request.Request(url, headers = head)
      download_response = request.urlopen(requ)
      download_html = download_response.read().decode('gbk','ignore')
      soup_texts = BeautifulSoup(download_html, 'lxml')
      texts = soup_texts.find_all(id = 'content', class_ = 'showtxt')
      soup_text = BeautifulSoup(str(texts), 'lxml')

      #将\xa0无法解码的字符删除
      download_text = soup_text.div.text.replace('\xa0','')

      #将爬取结果写入文档中
      with open("爬虫.txt","w") as filename:
        filename.write(download_text)

    转载于:https://www.cnblogs.com/kimi765096/p/8667197.html

    展开全文
  • 小熊捕文是一款非常实用的文档下载管理工具,本软件最大的特点就是只需要联网,就可以快速搜集小说、新闻等非常热门的资源,还可以进行资源的免费下载,同时也可以将纯文本、WORD、PPT中的选定的资料整合到一个网页...
  • 自行下载自己最喜欢的小说1部。存储为文本文档。要求长篇小说,20万字以上。任取其中10个人物,考虑他们的姓名、别名等等一系列因素。 1. 统计每个人在小说中出现的次数并排序。 2. 统计每个人在小说中出现的篇幅...
  • 这个小程序是用来爬取小说网站的小说的,一般的盗版小说网站都是很好爬取的 因为这种网站基本没有反爬虫机制的,所以可以直接爬取 该小程序以该网站http://www.126shu.com/15/下载全职法师为例 2.requests库 文档:...
  • 小说下载阅读器

    2014-03-26 15:54:12
    小说阅读下载 功能较全 适合初学者 内有文档资料 省时省力
  • Txt小说下载工具 v1.4.1

    2011-11-26 10:46:23
    自动采集网络小说资源,直接下载小说TXT文档
  • 下载文档:Python下载网络小说实例代码.txt】(友情提示:右键点上行txt文档名->目标另存为)Python下载网络小说实例代码 看网络小说一般会攒上一波,然后导入Kindle里面去看,但是攒的多了,机械的Ctrl+C和Ctrl+V...
  • python实现下载小说并保存在本地

    千次阅读 2017-05-11 14:36:12
    下载小说并保存在本地 import bs4,os,requests i = 0 xiaoShuo_NeiRong = []#定义存储小说章节内容对象的列表 ...#从小说网站上下载小说,并保存在txt文档中 while True: if i resOne = requests
  • 使用request下载小说

    2017-04-04 13:32:00
    使用requests ...1.1. 官方快速指导文档(最新版本,使用特定版本时注意版本号) 1.2. 打开一个网址 1.3. 发送其他请求 1.3.1. 发送post请求请使用requests.post(url) 1.3.2. 类似的,需要发送何种请求应...
  • 小小小说下载阅读器是一个阅读txt小说的免费软件。当然您也可以用来阅读任何txt文档。 小小小说下载阅读器特点: 1、完美支持txt文件,添加、删除书签,自动记录阅读位置。 2、多文档阅读。 3、一键隐藏、最简化...
  • 笔趣阁的小说下载,练手的作品,没有什么技术含量。就是在一个小说网站找到导航的地址,通过地址开始下载每一章节的内容,并且将每一个章节放在一个文本文档之中。
  • 下载器,可读取多种格式的文档,并可以便捷的整理文档
  • 猪食的夏天的日记 猪食的夏天的主页广播相册推荐喜欢二手活动发豆邮 【教你免费下载百度文档】 2011-05-28 17:50:30 ...例如要下载韩寒的小说《三重门》,在百度中输入“site:wenku.baidu.co...
  • python程序-小说下载

    2021-04-22 14:08:55
    提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档 文章目录写在之前一、基础知识二、使用...所谓小说下载器,实际上就是将一些网站上的在线阅读的小说,使用爬虫获取该网页的html文件,之后使用正
  • python爬取小说下载

    千次阅读 2018-02-07 21:50:40
    本人喜欢看小说,书龄也很大了,一些主流的书看完了,现在在看一本书,叫《仙魔同修》,作者叫流浪。但有另一本书和他同名,并且名气比它大,所以网上的很多下载资源都是这本书。我想下载下载看,但在网上没有找到...
  • 现在,龙龙txt小说下载阅读器出现了,能够有效地阅读大文件的txt文档,让你在看小说时不用再忍受那该死的记事本。 龙龙TXT小说下载阅读器独创人性化阅读体验 添加小说书签 分章节阅读 自动记录阅读进度 查看小说目录...
  • 阿里云盘是阿里云团队官方打造的网盘软件,内置超大储存空间,支持文档、图片、音视频等多种格式的储存,并且 就算储存再多也丝毫不会导致手机卡顿,还可以按时间、文件类型等多种方式分类,搜素查找超级方便。...
  • 1. 安装node环境,使用npm -v命令查看系统是否有nodejs,如果没有则进入node官网(http://nodejs.cn/),选择对应的操作系统后下载安装即可。npm -v6.14.62. 使用npm install gitbook-cli -g命令进行全局安装gitbook-...
  • 下载的TXT小说如何去除广告、去除多余空行?...网上下载喜欢的小说的TXT文档,打开,找到广告部分。一般这种广告都是在每章的开头或者结尾处,广告内容都是相同的。 2 点击TXT文档的菜单栏上的“编辑...
  • BeautifulSoup 的名字来源于大家耳熟能详的一部外国名著里面的小说,这部小说的名字叫做《爱丽丝梦游仙境》。从名字就可以看出,发明这个库的作者的目的是为了让使用这个库的人,心情舒畅,使用起来很方便舒适,接口...
  • 基于web小说阅读下载系统主要有游客可以浏览网站的主页,但是需要注册为会员后才能对电子书进免费行阅读或租借。会员登录需要输入帐号和密码信息,每个会员的帐号下都会对应有相关的个人信息,如个人邮箱,地址等,...
  • 随笔记,目标网址为https://www.qqxs.cc/fenlei1/1/,下载小说封面 提示:以下是本篇文章正文内容,下面案例可供参考 1.正常使用面向对象方法 确定url ==>获取响应==>xpath抽取响应内容中的图片地址==>...
  • 最近喜欢上了看小说,到网上下载了一部《神墓》来看,但是其全部内容都放在一个文本文档中,有2M大,仅仅打开这个文本文档就要花费不短的时间,因此我实现了下面的小程序,其功能是:将这部小说的内容按照章节抽出...
  • 小说下载 工具  Python3.6 + Requests + BeautifulSoup4 PS:点击 Requests 或 BeautifulSoup 可查看对应中文文档 任务  通过Python的爬虫下载一本小说。 此次爬取的网站为http://www.kbiquge.com/ 分析 ...
  • 看久电子书下载阅读器(TxtReader有声小说朗读软件),自然地做到模拟人看书的方式,为您看小说提供方便的阅读平台。它还能自动为您朗读,支持中文、英文、日语、各地方言等。不仅可以看小说,还能听小说(可以朗读...
  • 下载完整文档导读:科技发展日新月异,那些我们孩童时期看过的科幻小说情节正在变成现实。例如,自动驾驶汽车走进我们的生活,可以装进口袋的触屏笔记本电脑也出现了。虚拟现实技术的普及程度越来越高——事实上,...
  • 在线文档搜囤王V1.3

    2011-03-22 11:35:44
    小说搜索、小说下载、在线小说、免费小说、免费文档文档下载文档搜索、在线文档、建设方案、需求方案、设计概要、管理文档、办公文档、通信资料、电力资料、论文下载、专业文献、应用文书、文学作品、生活娱乐...

空空如也

空空如也

1 2 3 4 5 ... 15
收藏数 296
精华内容 118
关键字:

下载小说文档