精华内容
下载资源
问答
  • scrapy爬取51job前程招聘网站的机构信息并存储到xls工作簿中
  • Scrapy爬取51job

    2021-07-21 15:57:09
    一. 介绍 ①scrapy对前程无忧网址进行数据爬取 ②数据存入mysql ③spark数据清洗 ④flask+echarts数据可视化 二.... Scrapy详细介绍:...①51job的岗位信息都在<scricp></scricp&...

    目录

    一.       介绍

    二.步骤

            1.分析网页

     ①51job的岗位信息都在标签中,并且为json格式

     ②url解析

     ③URL拼接+翻页操作

    2.爬虫项目代码

            settings.py

            pipeline.py

            job.py

    3.mysql建表

    4.执行程序

    6.Spark数据清洗 

    7.可视化

            1.工作经验与薪资的情况

    2.岗位对学历的要求关系

            需求1,2运行效果

    3.公司福利词云图

    4.大数据工作城市分布

    需求3,4运行效果

    三。总结


    一.       介绍

           ①scrapy对前程无忧网址进行数据爬取
            ②数据存入mysql
            ③spark数据清洗
            ④flask+echarts数据可视化

    二.步骤

             Scrapy详细介绍:https://blog.csdn.net/ck784101777/article/details/104468780?

            1.分析网页

     ①51job的岗位信息都在<scricp></scricp>标签中,并且为json格式

     ②url解析

     ③URL拼接+翻页操作

    class JobSpider(scrapy.Spider):
        name = 'job'
        allowed_domains = ['search.51job.com']
        # start_urls = ['http://search.51job.com/']
        job_name = input("请输入岗位:")
    
        # 岗位解析后的代码
        keyword = urllib.parse.quote(job_name) + ",2,"
    
        # 页数
        # start_urls = f'https://search.51job.com/list/000000,000000,0000,00,9,99,{keyword},2,1.html?'
        start_urls = []
        # 生成url
        for i in range(1, 101):
            url_pre = 'https://search.51job.com/list/000000,000000,0000,00,9,99,'
            url_end = '.html?'
            url = url_pre + keyword + str(i) + url_end  # URL拼接
            start_urls.append(url)  # 将url添加至start_urls

    2.爬虫项目代码

    # 创建爬虫
    
    scrapy startproject job51
    
    # 创建爬虫程序
    
    cd job51
    
    scrapy genspider job search.51job.com

         创建执行文件,与scrapy.cfg同级

    from scrapy import cmdline
    
    cmdline.execute("scrapy crawl job".split())

            settings.py

    
    # 爬虫名
    BOT_NAME = 'job51'
    
    SPIDER_MODULES = ['job51.spiders']
    NEWSPIDER_MODULE = 'job51.spiders'
    # 日志等级
    LOG_LEVEL = "WARNING"
    # 机器人协议
    ROBOTSTXT_OBEY = False
    
    # 并发执行度
    CONCURRENT_REQUESTS = 100
    CONCURRENT_REQUESTS_PER_DOMAIN = 100
    CONCURRENT_REQUESTS_PER_IP = 100
    
    # 请求头
    DEFAULT_REQUEST_HEADERS = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en',
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    
    # 开启管道
    ITEM_PIPELINES = {
        'job51.pipelines.Job51Pipeline': 300,
    }

            pipeline.py

    import pymysql
    # 打印logging日志
    import logging
    
    logger = logging.getLogger(__name__)
    
    # 日志输出打印
    class Job51Pipeline:
        def process_item(self, item, spider):
            
            # 连接MySQL数据库
            connect = pymysql.connect(host='localhost', user='root', password='123456', db='spark', port=3306,
                                      charset="utf8")
            cursor = connect.cursor()
    
            # 往数据库里面写入数据
            try:
                cursor.execute(
                    "insert into scrapy_job VALUES ('{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}','{}')".format(
                        item["job_id"], item["job_name"], item["job_price"], item["job_url"], item["job_time"],
                        item["job_place"], item["job_edu"], item["job_exp"], item["job_well"], item["company_name"],
                        item["company_type"], item["company__mag"], item["company_genre"]))
                connect.commit()
            except Exception as error:
                # 出现错误时打印错误日志
                logging.warning(error)
    
            # 关闭数据库
    
            cursor.close()
            connect.close()
    
            return item

            job.py

       

    import json
    
    import scrapy
    import urllib.parse
    import re
    
    
    class JobSpider(scrapy.Spider):
        name = 'job'
        allowed_domains = ['search.51job.com']
        # start_urls = ['http://search.51job.com/']
        job_name = input("请输入岗位:")
    
        # 岗位解析后的代码
        keyword = urllib.parse.quote(job_name) + ",2,"
    
        # 页数
        # start_urls = f'https://search.51job.com/list/000000,000000,0000,00,9,99,{keyword},2,1.html?'
        start_urls = []
        # 生成url
        for i in range(1, 351):
            url_pre = 'https://search.51job.com/list/000000,000000,0000,00,9,99,'
            url_end = '.html?'
            url = url_pre + keyword + str(i) + url_end  # URL拼接
            start_urls.append(url)  # 将url添加至start_urls
    
        def parse(self, response):
            # with open("job.html", "w", encoding="gbk") as f:
            #     f.write(response.text)
            re1 = re.compile("""<script>(.*?)</script>""")
            selectors = re.findall(re1, response.text)
            req2 = response.xpath("/html/body/script/text()").extract_first()[29:]
    
            # with open("job.json", "w", encoding="utf-8") as f:
            #     f.write(req2)
            selectors = json.loads(req2)["engine_search_result"]
            item = {}
    
            for selector in selectors:
    
                # job_id, job_name, job_price, job_url: String, job_time: String, job_place: String, job_edu: String, job_exp: String, job_well: String, company_name: String, company_type: String, company_mag: String, company_genre: String
                item["job_id"] = selector["jobid"]
                item["job_name"] = selector["job_name"]
    
                if (selector["providesalary_text"]):
                    item["job_price"] = selector["providesalary_text"]
                else:
                    item["job_price"] = ''
    
                item["job_url"] = selector["job_href"]
                item["job_time"] = selector["issuedate"]
    
                item["job_place"] = selector["workarea_text"]
    
                if (len(selector["attribute_text"]) > 3):
                    item["job_edu"] = selector["attribute_text"][2]
                else:
                    item["job_edu"] = "无要求"
                item["job_exp"] = selector["attribute_text"][1]
                
                if (selector["jobwelf"]):
                    item["job_well"] = selector["jobwelf"]
                else:
                    item["job_well"] = " 无福利 "
                item["company_name"] = selector["company_name"]
                item["company_type"] = selector["companytype_text"]
                item["company__mag"] = selector["companysize_text"]
                if (selector["companysize_text"]):
                    item["company__mag"] = selector["companysize_text"]
                else:
                    item["company__mag"] = ""
                item["company_genre"] = selector["companyind_text"]
    
                yield item

    3.mysql建表

    create table scrapy_job(
    job_id varchar(100), job_name varchar(200), job_price varchar(100), job_url varchar(100), job_time varchar(100), 
    job_place varchar(100), job_edu varchar(100), job_exp varchar(100), job_well varchar(200), company_name varchar(100), 
    company_type varchar(100), company_mag varchar(100), company_genre varchar(100)
    )

    4.执行程序

     查看数据库中的数据

    6.Spark数据清洗 

         ① job_exp :取最大数值,例如:5-7年经验 ->7,无需经验则为0,空值设为0
         ② job_price:取平均工资,1-1.5万/月 ->12500,空值设为0
         ③ job_time:精确到日期,2021-04-19 13:31:43->2021-04-19,空值设为无发布时间
         ④ company_type:去掉括号以及括号里的数据
         ⑤ company_mag:取最大数,150-500人->500,空值设为0
         ⑥ company_genre:去掉括号以及括号里的数据
         ⑦ 去重

    pom.xml

        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_2.11</artifactId>
          <version>2.1.1</version>
        </dependency>
    
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-sql_2.11</artifactId>
          <version>2.1.1</version>
        </dependency>
    
        <dependency>
          <groupId>mysql</groupId>
          <artifactId>mysql-connector-java</artifactId>
          <version>8.0.21</version>
        </dependency>
    
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-streaming_2.11</artifactId>
          <version>2.1.1</version>
        </dependency>
    
        <dependency>
          <groupId>com.alibaba</groupId>
          <artifactId>fastjson</artifactId>
          <version>1.2.66</version>
        </dependency>
    
        <dependency>
          <groupId>com.fasterxml.jackson.core</groupId>
          <artifactId>jackson-core</artifactId>
          <version>2.10.1</version>
        </dependency>
      

     数据库中copy一份爬虫阶段的表格式,表名为scrapy_job_copy1,用来存放清洗后的数据

    代码

    package job
    
    import org.apache.spark.sql.{SaveMode, SparkSession}
    
    import java.text.SimpleDateFormat
    import java.util.{Date, Properties}
    import java.util.regex.{Matcher, Pattern}
    
    object HomeWork1 {
      def main(args: Array[String]): Unit = {
        /**
         * job_exp :取最大数值,例如:5-7年经验 ->7,无需经验则为0,空值设为0
         * job_price:取平均工资,1-1.5万/月 ->12500,空值设为0
         * job_time:精确到日期,2021-04-19 13:31:43->2021-04-19,空值设为无发布时间
         * company_type:去掉括号以及括号里的数据
         * company_mag:取最大数,150-500人->500,空值设为0
         * company_genre:去掉括号以及括号里的数据
         * 去重
         */
          //经验
        val regex = """\d+""".r
        val pattern = Pattern.compile("\\d+")
    
        //薪水
        val salary1 = Pattern.compile("""^(\d+?\.?\d*?)\-(\d+?\.?\d*?)万/年$""")
        val salary2 = Pattern.compile("""^(\d+?\.?\d*?)\-(\d+?\.?\d*?)万/月$""")
        val salary3 = Pattern.compile("""^(\d+?\.?\d*?)\-(\d+?\.?\d*?)千/月$""")
    
        //时间
        val format: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
        val toformat: SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
    
        //去掉括号
        val company = Pattern.compile("""(.*?)""")
    
        //mysql
        val properties = new Properties()
        properties.setProperty("user","root")
        properties.setProperty("password","123456")
    
        val session = SparkSession.builder().master("local[1]").appName("te").getOrCreate()
        //val rdd = session.read.option("delimiter", ",").option("header", "true").csv("a/job.csv").distinct().rdd
    
        val rdd = session.read.jdbc("jdbc:mysql://localhost:3306/spark?serverTimezone=UTC", "scrapy_job", properties).rdd
        val value = rdd.map({
          item => {
    
            //TODO job_exp :取最大数值,例如:5-7年经验 ->7,无需经验则为0,空值设为0
            var jobexp = "0"
            if (item.isNullAt(7) || item.equals("") || ((!item.isNullAt(7)) & item.getAs[String](7).equals("无需经验"))) {
              jobexp = "0"
            } else {
              val iterator = regex.findAllIn(item(7).toString)
    
              iterator.foreach{
                x=>{
                 if(x.toInt > jobexp.toInt){
                   jobexp = x
                 }
    
                }
              }
    
            // TODO 薪水 job_price:取平均工资,1-1.5万/月 ->12500,空值设为0
            var jobprice = 0.0
    
            if(item.isNullAt(2)){
              jobprice=0.0
            }else {
              val matcher1 = salary1.matcher(item.getAs(2)) // 万/年
              val matcher2 = salary2.matcher(item.getAs(2)) // 万/月
              val matcher3 = salary3.matcher(item.getAs(2)) // 千/月
              if(matcher1.find()){
    
    
                jobprice =(matcher1.group(1).toDouble + matcher1.group(2).toDouble) * 10000 / 24
                //print("万/年",jobprice)
              }else if(matcher2.find()){
    
                jobprice = (matcher2.group(1).toDouble +matcher2.group(2).toDouble) * 10000 / 2
                //print("万/月",jobprice)
              }else if(matcher3.find()){
    
                jobprice = (matcher3.group(1).toDouble + matcher3.group(2).toDouble) * 1000
                //print("千/月",jobprice)
              }
            }
    
            //TODO job_time:精确到日期,2021-04-19 13:31:43->2021-04-19,空值设为无发布时间
    
            var jobtime = "无发布时间"
            if(!item.isNullAt(4)){
              jobtime = toformat.format(format.parse(item.getAs[String](4)))
    
            }
    
            //TODO company_type:去掉括号以及括号里的数据
            var companytype = ""
            if(!item.isNullAt(10)){
              companytype = item.getAs(10).toString.replaceAll("\\(.*?\\)|\\(.*?\\)","")
    
            }
    
            //TODO company_mag:取最大数,150-500人->500,空值设为0
            var companymag = 0
            if (!item.isNullAt(11)){
              val matcher2 = pattern.matcher(item.getAs(11))
              while (matcher2.find()) {
                if (matcher2.group().toInt > companymag.toInt) {
                  companymag = matcher2.group().toInt
                }
              }
            }
    
            //TODO company_genre:去掉括号以及括号里的数据
    
            var companygenre = ""
            if (!item.isNullAt(12)){
              companygenre = item.getString(12).replaceAll("\\(.*?\\)","")
    
            }
    
            Res(item(0).toString,item(1).toString,jobprice.toInt.toString,item(3).toString,jobtime,item(5).toString,item(6).toString,jobexp,item(8).toString,item(9).toString,companytype,companymag.toString,companygenre)
    
          }
        })
    
        import session.implicits._
        //追加数据到Mysql
        value.toDF().write.mode(SaveMode.Overwrite).jdbc("jdbc:mysql://localhost:3306/spark?serverTimezone=UTC", "scrapy_job_copy1", properties)
    
      }
    
      case class Res(job_id: String, job_name: String, job_price: String, job_url: String, job_time: String, job_place: String, job_edu: String, job_exp: String, job_well: String, company_name: String, company_type: String, company_mag: String, company_genre: String)
    
    }
    

    7.可视化

    1.创建vlsual.py

    2.项目结构图

    1.工作经验与薪资的情况

            1和2用echarts可视化处理

            3和4用的是pyecharts

            之前对数据清洗时把工作经验取最大值了,所以数据清洗代码把工作经验的处理注释

            

           

    2.岗位对学历的要求关系

     1和2的可视化代码

    import json
    
    from flask import Flask, render_template
    import pymysql
    
    app = Flask(__name__)
    
    
    def con_mysql(sql):
        # 连接
        con = pymysql.connect(host='localhost', user='root', password='123456', database='spark', charset='utf8')
        # 游标
        cursor = con.cursor()
        # 执行sql
        cursor.execute(sql)
        # 获取结果
        result = cursor.fetchall()
        return result
    
    
    def extract_all(result):
        nianFen = []
        chanLiang = []
        for i in result:
            nianFen.append(i[0])
            chanLiang.append(float(i[1]))
        return (nianFen, chanLiang)
    
    
    # 工作经验
    @app.route("/")
    def hello_world():
        # sql
        sql = "SELECT job_exp,TRUNCATE(avg(job_price),2) as salary FROM scrapy_job_copy1 GROUP BY job_exp having job_exp like'%经验' ORDER BY salary asc"
        result = con_mysql(sql)
        list = extract_all(result)
        print(list)
        return render_template("req2-2.html", nianFen=list[0], chanLiang=list[1])
    
    
    # 学历关系
    @app.route('/2')
    def get_data():
        sql = "select count(job_edu) as cnt from scrapy_job_copy1 GROUP BY job_edu ORDER BY cnt asc"
        result = con_mysql(sql)
        cnt = []
        for i in result:
            cnt.append(i[0])
        print(result)
        return render_template("req3.html", cnt=cnt)
    
    
    # @app.route("/1")
    # def hello_world():
    #     # sql
    #     sql = "select job_edu,count(job_edu)as cnt from scrapy_job_copy1 GROUP BY job_edu ORDER BY cnt asc"
    #     result = con_mysql(sql)
    #     list = extract_all(result)
    #     print(list)
    #     return render_template("req2.html", nianFen=list[0], chanLiang=list[1])
    if __name__ == '__main__':
        app.run()
    

    需求1,2运行效果

    3.公司福利词云图

            需要将字段job_well 例如:[五险一金 餐饮补贴 通讯补贴 绩效奖金 定期体检]

           进行wordCount处理 -----> (五险一金,200)。

            创建一个结果表, create table job_well( job_well varchar(20) , cnt int)

            Spark代码

    package job
    
    import org.apache.spark.sql.{SaveMode, SparkSession}
    
    import java.util.Properties
    
    object Job_well {
      def main(args: Array[String]): Unit = {
    
        // todo 福利词云图
    
        //session
        val spark = SparkSession.builder()
          .appName("51job")
          .master("local[*]")
          .getOrCreate()
    
        // mysql连接
        val properties = new Properties()
        properties.setProperty("user", "root")
        properties.setProperty("password", "123456")
    
        // 读取mysql数据
    
        //隐式转换
        import spark.implicits._
    
        //source
        val dataFrame = spark.read.jdbc("jdbc:mysql://localhost:3306/spark?serverTimezone=UTC", "scrapy_job_copy1", properties)
    
        dataFrame.distinct().createTempView("job")
    
        val dataSql = spark.sql(""" SELECT job_well from job """).rdd
    
        val value = dataSql.flatMap {
          row => {
            // 成都-武侯区
            val strings = row.toString().replaceAll("[\\[\\]]", "").split(" ")
    
            strings
          }
        }
        val result = value.map((_, 1)).reduceByKey(_ + _)
    
    
        //result.foreach(println)
        val frame = result.toDF("job_well", "cnt")
    
        //sink
        frame.write.mode(SaveMode.Overwrite).jdbc("jdbc:mysql://localhost:3306/spark?serverTimezone=UTC", "job_well", properties)
    
        //资源释放
        spark.close()
      }
    
    }
    

    4.大数据工作城市分布

            需要将字段job_place  例如:乌鲁木齐-天山区  ---> 乌鲁木齐 在进行分组聚合

            采用spark 得到数据,存入mysql在进行可视化

       job_place.scala

    
    import org.apache.spark.sql.{SaveMode, SparkSession}
    
    import java.util.Properties
    
    object Job_place {
      def main(args: Array[String]): Unit = {
    
        // todo 对地区字段进行切分聚合
    
        //session
        val spark = SparkSession.builder()
          .appName("51job")
          .master("local[*]")
          .getOrCreate()
    
        // mysql连接
        val properties = new Properties()
        properties.setProperty("user","root")
        properties.setProperty("password","123456")
    
        // 读取mysql数据
        
        //隐式转换
        import spark.implicits._
    
        //source
        val dataFrame = spark.read.jdbc("jdbc:mysql://localhost:3306/spark?serverTimezone=UTC", "scrapy_job", properties)
    
        dataFrame.createTempView("job")
    
        val dataSql = spark.sql(""" SELECT job_place from job """).rdd
        
        val value = dataSql.flatMap{
          row => {
            // 成都-武侯区
            val strings = row.toString()
              .replaceAll("[\\[\\]]","")
              .split("\\-.+$")
            strings
          }
        }
        val result = value.map((_, 1)).reduceByKey(_ + _)
    
    
        //result.foreach(println)
        val frame = result.toDF("job_place", "cnt")
    
        //sink
        frame.write.mode(SaveMode.Overwrite).jdbc("jdbc:mysql://localhost:3306/spark?serverTimezone=UTC", "job_place", properties)
        
        //资源释放
        spark.close()
      }
    
    }
    

     python数据可视化,这里采用的式pyecharts

    visual.py

    import re
    
    from flask import Flask, render_template
    import pymysql
    from pyecharts import options as opts
    from pyecharts.charts import Geo
    from pyecharts.charts import WordCloud
    import jieba
    import collections
    
    app = Flask(__name__)
    
    
    # 连接mysql 输入查询语句获得结果集
    def init_mysql(sql):
        con = pymysql.connect(host="localhost", user="root", password="123456", database="spark", charset="utf8", port=3306)
    
        cursor = con.cursor()
    
        cursor.execute(sql)
        result = cursor.fetchall()
        arr = []
        for i in result:
            # a = (re.findall("^..",i[0]))
            a = re.split("\-.+$", i[0])
            b = i[1]
            arr.append((a[0], b))
        return arr
    
    
    @app.route("/4")
    def req14():
        sql = """SELECT * FROM job_place """
        result = init_mysql(sql)
        print(result)
        c = (
            Geo()
                .add_schema(maptype="china")
                .add("geo", result)
                .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
                .set_global_opts(
                visualmap_opts=opts.VisualMapOpts(),
                title_opts=opts.TitleOpts(title="Geo-基本示例"),
            )
        )
        c.render("./job51/templates/大数据工作城市分布.html")
    
        return render_template("大数据工作城市分布.html")
    
    
    @app.route("/3")
    def req3():
        sql = """SELECT * FROM job_well """
        result = init_mysql(sql)
        print(result)
    
        wc = WordCloud()
        wc.add(series_name="福利词云图", data_pair=result)
        wc.render_notebook()
        wc.render("./job51/templates/福利词云图.html")
    
        return render_template("福利词云图.html")
    
    
    if __name__ == '__main__':
        app.run()
    

    这里运行时会报一个找不到 字段job_place'黔东南'的错,所以把这些都干掉,有点小麻烦。

    delete from job_place 
    where job_place='燕郊开发区' or '黔东南' or '怒江' or '雄安新区' or '普洱' or '黔南' or '延边'
    

    需求3,4运行效果

    第一次运行后把c.render("./job51/templates/大数据工作城市分布.html")注释掉

    因为不会传参,只能在html网页中把最大值手动传入,

    三。总结

            本人是大二大专生,第一次发博客,希望个位程序猿多多指点。这篇博客也是参考了博客上的很多资料,项目加博客也码了3天,不断挣扎出来。最后祝个位能早日拿到offer。

    作者:loding...
    来源:CSDN
    版权声明:本文为博主原创文章,转载请附上博文链接!

    展开全文
  • scrapy爬取 51job

    2019-05-23 21:22:42
    import scrapy # .. 上级目录 .当前目录 # 从上级目录中的items文件中引入JobSpidererItem数据模型类 from ..items import JobSpidererItem class JobsSpider(scrapy.Spider): name = 'jobs' all...

    spider.py

    # -*- coding: utf-8 -*-
    import scrapy
    # .. 上级目录  .当前目录
    # 从上级目录中的items文件中引入JobSpidererItem数据模型类
    from ..items import JobSpidererItem
    
    class JobsSpider(scrapy.Spider):
        name = 'jobs'
        allowed_domains = ['search.51job.com']
        start_urls = ['http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=010000%2C020000%2C030200%2C040000%2C080200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=python&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9','http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=010000%2C020000%2C030200%2C040000%2C080200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=java&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9','http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=010000%2C020000%2C030200%2C040000%2C080200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=php&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9']
        # 请求发送成功并返回响应后,会自动调用parse这个函数
        def parse(self, response):
            # 先找到所有的divs,包裹职位信息的标签
            divs = response.xpath('//div[@id="resultList"]/div[@class="el"]')
            for div in divs:
                # 找到职位名称
                zwmc = div.xpath('p/span/a/text()').extract_first('未知')
                # 去除职位名称中的空格
                zwmc = zwmc.strip()
                zwmc = zwmc.strip('\n\r')
                # 找到公司名称
                gsmc = div.xpath('span[@class="t2"]//text()').extract_first('未知')
                # 找到工作地点
                gzdd = div.xpath('span[@class="t3"]/text()').extract_first('未知')
                # 找到职位月薪
                zwyx = div.xpath('span[@class="t4"]/text()').extract_first('面议')
                # 找到发布日期
                fbrq = div.xpath('span[@class="t5"]/text()').extract_first('未知')
    
                # 创建JobSpidererItem对象,保存找到的数据
                item = JobSpidererItem()
                item['zwmc'] = zwmc
                item['gsmc'] = gsmc
                item['zwyx'] = zwyx
                item['gzdd'] = gzdd
                item['fbrq'] = fbrq
                # yield 传递item
                yield item
            # 找到下一页的链接地址
            next_href = response.xpath('//div[@class="p_in"]/ul/li[last()]/a/@href').extract_first('')
            # 判断是否有下一页
            if next_href:
                # yield 请求对象
                yield scrapy.Request(next_href)
    

    items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    # item数据模型
    class JobSpidererItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        zwmc = scrapy.Field() # 职位名称
        gsmc = scrapy.Field() # 公司名称
        zwyx = scrapy.Field() # 职位月薪
        gzdd = scrapy.Field() # 工作地点
        fbrq = scrapy.Field() # 发布日期
    

    pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import xlwt
    # 写入excel表格
    class JobSpidererPipeline(object):
    
    def __init__(self):
        # 初始化xls表格
        self.workbook = xlwt.Workbook(encoding='utf-8')
        self.sheet = self.workbook.add_sheet(u'51job职位数据')
        self.sheet.write(0, 0, '职位名称')
        self.sheet.write(0, 1, '公司名称')
        self.sheet.write(0, 2, '工作地点')
        self.sheet.write(0, 3, '职位月薪')
        self.sheet.write(0, 4, '发布日期')
        # 定义一个属性记录处理的个数
        self.count = 1
    # 当爬虫关闭的时候
    def close_spider(self,spider):
        # 保存职位信息
    
        self.workbook.save(u'51job职位信息.xls')
    
    # 销毁对象的时候执行
    def __del__(self):
        print '对象被销毁了.......'
    
    
    def process_item(self, item, spider):
        # 处理item数据
        self.sheet.write(self.count, 0, item['zwmc'])
        self.sheet.write(self.count, 1, item['gsmc'])
        self.sheet.write(self.count, 2, item['gzdd'])
        self.sheet.write(self.count, 3, item['zwyx'])
        self.sheet.write(self.count, 4, item['fbrq'])
        # 行号+1
        self.count += 1
        print self.count
    
        return item
    

    settings.py

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for Job_Spiderer project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'Job_Spiderer'
    
    SPIDER_MODULES = ['Job_Spiderer.spiders']
    NEWSPIDER_MODULE = 'Job_Spiderer.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'Job_Spiderer (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 1
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
       'User-Agent':"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0",
    }
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'Job_Spiderer.middlewares.JobSpidererSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'Job_Spiderer.middlewares.JobSpidererDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'Job_Spiderer.pipelines.JobSpidererPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    

    debug.py

    # coding: utf-8
    # 引入execute函数
    from scrapy.cmdline import execute
    # 执行命令
    execute(['scrapy', 'crawl', 'jobs'])
    

    middlewares.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy import signals
    
    
    class JobSpidererSpiderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the spider middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_spider_input(self, response, spider):
            # Called for each response that goes through the spider
            # middleware and into the spider.
    
            # Should return None or raise an exception.
            return None
    
        def process_spider_output(self, response, result, spider):
            # Called with the results returned from the Spider, after
            # it has processed the response.
    
            # Must return an iterable of Request, dict or Item objects.
            for i in result:
                yield i
    
        def process_spider_exception(self, response, exception, spider):
            # Called when a spider or process_spider_input() method
            # (from other spider middleware) raises an exception.
    
            # Should return either None or an iterable of Response, dict
            # or Item objects.
            pass
    
        def process_start_requests(self, start_requests, spider):
            # Called with the start requests of the spider, and works
            # similarly to the process_spider_output() method, except
            # that it doesn’t have a response associated.
    
            # Must return only requests (not items).
            for r in start_requests:
                yield r
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    
    
    class JobSpidererDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
    
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object
            # - return a Request object
            # - or raise IgnoreRequest
            return response
    
        def process_exception(self, request, exception, spider):
            # Called when a download handler or a process_request()
            # (from other downloader middleware) raises an exception.
    
            # Must either:
            # - return None: continue processing this exception
            # - return a Response object: stops process_exception() chain
            # - return a Request object: stops process_exception() chain
            pass
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    
    展开全文
  • class Job51Item(scrapy.Item): title = scrapy.Field() salary = scrapy.Field() job_city = scrapy.Field() requirement = scrapy.Field() # 经验、学历、人数 company_name = scrapy.Field() company_size ...

    1 定义抓取内容

    在items.py中添加如下代码:

    class Job51Item(scrapy.Item):
        title = scrapy.Field()
        salary = scrapy.Field()
        job_city = scrapy.Field()
        requirement = scrapy.Field() # 经验、学历、人数
        company_name = scrapy.Field()
        company_size = scrapy.Field()
        publish_date = scrapy.Field()
        job_advantage_tags = scrapy.Field() # 福利
         url = scrapy.Field()
    

    2 编写爬虫

    由于51job的反爬虫机制,其招聘信息数据并不是如游览器一般直接展示出来,而是隐藏在script标签内

    在这里插入图片描述

    在spiders文件夹新建文件job_51.py,并编写如下代码:

    # -*- coding: utf-8 -*-
    import json
    import re
    import scrapy
    from scrapy.http import Request
    
    from job.items import Job51Item
    
    
    class Job51Spider(scrapy.Spider):
        name = 'job51'
        allowed_domains = ['jobs.51job.com', 'search.51job.com']
        start_urls = [
            'https://search.51job.com/list/090200,000000,0000,00,9,99,python,2,'
            '3.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0'
            '&dibiaoid=0&line=&welfare=']
    
        def parse(self, response):
            body = response.body.decode("gbk")
            with open("b.html","w",encoding="utf-8") as f:
                f.write(body)
            data = re.findall('window.__SEARCH_RESULT__ =(.+)}</script>', str(body))[0] + "}"
            data = json.loads(data)
            item = Job51Item()
            for result in data["engine_search_result"]:
                item["requirement"] = result["attribute_text"]
                item["url"] = result["job_href"]
                item["title"] = result["job_name"]
                item["salary"] = result["issuedate"]
                item["job_city"] = result["workarea_text"]
                item["publish_date"] = result["issuedate"]
                item["job_advantage_tags"] = result["jobwelf"]
                item["company_name"] = result["company_name"]
                item["company_size"] = result["companysize_text"]
                yield item
            for i in range(2, 10):
                url = f"https://search.51job.com/list/090200,000000,0000,00,9,99,python,2,{i}.html?lang=c&postchannel" \
                      f"=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line" \
                      f"=&welfare= "
                yield Request(url=url, callback=self.parse, )
    

    3 定义管道

    3.1 启用pipelines

    在settings.py中添加如下代码:

    ITEM_PIPELINES = {
        'job.pipelines.JobPipeline': 300,
    }
    

    注:或者取消该部分内容注释

    3.2 编写管道代码

    在pipelines.py编写如下代码:

    import json
    
    from itemadapter import ItemAdapter
    
    
    class JobPipeline:
        def __init__(self):
            self.fp = open('result.json', 'w', encoding='utf-8') 
    
        def open_spider(self, spider):
            self.fp.write("[")
    
        def process_item(self, item, spider):
            data = json.dumps(dict(item), ensure_ascii=False)
            self.fp.write(data+',\n')
            return item
    
        def close_spider(self, spider):
            self.fp.write("]")
            self.fp.close()
            print("spider end")
    
    展开全文
  • class Job51Item(scrapy.Item): title = scrapy.Field() salary = scrapy.Field() job_city = scrapy.Field() requirement = scrapy.Field() # 经验、学历、人数 company_name = scrapy.Field() company_size ...

    1 定义抓取内容

    在items.py中添加如下代码:

    class Job51Item(scrapy.Item):
        title = scrapy.Field()
        salary = scrapy.Field()
        job_city = scrapy.Field()
        requirement = scrapy.Field() # 经验、学历、人数
        company_name = scrapy.Field()
        company_size = scrapy.Field()
        publish_date = scrapy.Field()
        job_advantage_tags = scrapy.Field() # 福利
         url = scrapy.Field()

    2 编写爬虫

    由于51job的反爬虫机制,其招聘信息数据并不是如游览器一般直接展示出来,而是隐藏在script标签内

    在spiders文件夹新建文件job_51.py,并编写如下代码:

    # -*- coding: utf-8 -*-
    import json
    import re
    import scrapy
    from scrapy.http import Request
    
    from job.items import Job51Item
    
    
    class Job51Spider(scrapy.Spider):
        name = 'job51'
        allowed_domains = ['jobs.51job.com', 'search.51job.com']
        start_urls = [
            'https://search.51job.com/list/090200,000000,0000,00,9,99,python,2,'
            '3.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0'
            '&dibiaoid=0&line=&welfare=']
    
        def parse(self, response):
            body = response.body.decode("gbk")
            with open("b.html","w",encoding="utf-8") as f:
                f.write(body)
            data = re.findall('window.__SEARCH_RESULT__ =(.+)}</script>', str(body))[0] + "}"
            data = json.loads(data)
            item = Job51Item()
            for result in data["engine_search_result"]:
                item["requirement"] = result["attribute_text"]
                item["url"] = result["job_href"]
                item["title"] = result["job_name"]
                item["salary"] = result["issuedate"]
                item["job_city"] = result["workarea_text"]
                item["publish_date"] = result["issuedate"]
                item["job_advantage_tags"] = result["jobwelf"]
                item["company_name"] = result["company_name"]
                item["company_size"] = result["companysize_text"]
                yield item
            for i in range(2, 10):
                url = f"https://search.51job.com/list/090200,000000,0000,00,9,99,python,2,{i}.html?lang=c&postchannel" \
                      f"=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line" \
                      f"=&welfare= "
                yield Request(url=url, callback=self.parse, )

    3 定义管道

    3.1 启用pipelines

    在settings.py中添加如下代码:

    ITEM_PIPELINES = {
        'job.pipelines.JobPipeline': 300,
    }

    注:或者取消该部分内容注释

    3.2 编写管道代码

    在pipelines.py编写如下代码:

    import json
    
    from itemadapter import ItemAdapter
    
    
    class JobPipeline:
        def __init__(self):
            self.fp = open('result.json', 'w', encoding='utf-8') 
    
        def open_spider(self, spider):
            self.fp.write("[")
    
        def process_item(self, item, spider):
            data = json.dumps(dict(item), ensure_ascii=False)
            self.fp.write(data+',\n')
            return item
    
        def close_spider(self, spider):
            self.fp.write("]")
            self.fp.close()
            print("spider end")

    源码获取记得加群哦:1136192749

     

    展开全文
  • 通过这两个文件,,可以存储数据(但是注意在爬虫文件中也在写相应的代码具体参考51job网和智联招聘两个文件)1.先设置items文件# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See ...
  • Scrapy框架爬取51job和智联招聘数据信息 Scrapy框架爬取51job和智联招聘数据信息
  • 上一篇简单介绍了通过使用 scrapy 爬取 51 职位信息,也成功爬取了下来,包括下一页直到最后的信息都能获取到,这里就简述一下如何将爬取的内容存到数据库中,用的是 mysql 数据库。这代码是好几个月之前就写的,但...
  • 使用scrapy爬取前程无忧51job网站

    千次阅读 2020-08-25 06:43:40
    求助看下以下哪里出现问题导致拿到了数据但是不能保存到mongodb中 spider主程序的py文件 # -*- coding: utf-8 -*- import scrapy import json ... allowed_domains = ['jobs.51job.com', 'search.51job.com']
  • 刚学会了建 scrapy 框架,当然就忍不住想去练练手了,就挑个 51 job 去了解一下职位需求情况。 前面我们已经说了如何去创建一个 scrapy 框架,以及改一下 setting 配置文件,这里我们还要先再改一下 setting 文件:...
  • Scrapy爬取前程无忧(51job)python职位信息 开始是想做数据分析的,上网上找教程,看到相关博客我就跟着做,但是没数据就只能开始自己爬呗。顺便给51job的工作人员提提建议,我爬的时候Scrapy访问量开到128,relay...
  • spiders中代码如下 import scrapy from scrapy import Request from QianCheng.items import QianchengItem import re ...class ExampleSpider(scrapy.... name = '51job' def start_requests(self): ...
  • scrapy框架爬取51job

    千次阅读 2017-03-04 14:05:46
    # -*- coding: utf-8 -*- import scrapy from scrapy.spiders import CrawlSpider,Rule from scrapy.linkextractors import LinkExtractor from manhua.items import ManhuaItem class DemoSpider(CrawlSpider):
  • 今天老师讲解了Python中的爬虫框架--scrapy,然后带领我们做了一个小爬虫--爬取51job网的职位信息,并且保存到数据库中 用的是Python3.6 pycharm编辑器 爬虫主体: import scrapy from ..items import ...
  • 使用scrapy框架爬取51job全国数据分析职位信息并做简单分析 工具:scrapy,MongoDB,Excel,tableau 1.分析网页链接,里面包含有【keyword=数据分析师&keywordtype=2&curr_page=1】这些关键信息,可以看出...
  • 第一种直接从网页中获取数据,采用的是scrapy的基础爬虫模块,爬的是51job 第二种采用扒接口,从接口中获取数据,爬的是智联招聘 第三种采用的是整站的爬取,爬的是拉钩网 获取想要的数据并将数据存入mysql数据库中,...
  • scrapy同时爬取51job和智联招聘

    千次阅读 2018-03-08 20:21:44
    scrapy同时运行2个及以上爬虫方法每个爬虫创建对应的运行文件,然后运行每个文件。每个爬虫的数据模型要相同。run1.py# -*- coding:utf-8 -*-from scrapy import cmdline#ccmdline.execute(['scrapy, crawl, 爬虫1']...
  • Scrapy-redis爬取51Job数据 一、工具 python3 scrapy-redis redis 二、准备工作 (一)安装各个模块 项目中使用到工具主要是Python3和Redis,且需要安装相应的python模块,即scrapyscrapy_redis、redis等。 具体...
  • Python利用Scrapy爬取前程无忧招聘信息 ** 一、爬虫准备 Python:3.x Scrapy PyCharm 二、爬取目标 爬取前程无忧的职位信息,此案例以Python为关键词爬取相应的职位信息,通过Scrapy来爬取相应信息,并将爬取...
  • scrapy爬取python职位

    2019-01-09 11:57:00
    使用scrapy框架爬取前程无忧上的python职位 ...scrapy genspider Job51Spider www.51job.com 使用编译器打开Jobs开始项目 打开/spiders/Job51Spider.py 写入 # -*- coding: utf-8 -*- imp...
  • scrapy模拟浏览器爬取51job 51job链接 网络爬虫时,网页不止有静态页面还有动态页面,动态页面主要由JavaScript动态渲染,网络爬虫经常遇见爬取JavaScript动态渲染的页面。 动态渲染页面爬取,就是模拟浏览器的运行...
  • Scrapy爬取前程无忧

    千次阅读 热门讨论 2021-01-25 04:27:10
    https://search.51job.com/list/000000,000000,0000,32,9,99,+,2,xxxx.html 只要修改其中的xxxx,即可实现多网页爬取 2、前程无忧的网页数据是由动态获取json数据,并由js变量接收,然后显示在网页中,因此爬取时...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 507
精华内容 202
关键字:

scrapy爬取51job