精华内容
下载资源
问答
  • Java爬取数据

    2020-11-29 15:02:17
    HttpUtils public class HttpUtils { //声明httpclient管理器对象() private static PoolingHttpClientConnectionManager cm; private static List<String> userAgentlist = null; private static ...

    HttpUtils

    public class HttpUtils {
        //声明httpclient管理器对象()
       // private static PoolingHttpClientConnectionManager cm;
        private static List<String> userAgentlist = null;
        private static RequestConfig config =null;
    
        static {
            //cm = new PoolingHttpClientConnectionManager();
            //cm.setMaxTotal(200);
            //cm.setDefaultMaxPerRoute(20);
            config =RequestConfig.custom()
                    .setSocketTimeout(10000)
                    .setConnectionRequestTimeout(10000)
                    .setConnectionRequestTimeout(10000)
                    .build();
            userAgentlist =new ArrayList<>();
            userAgentlist.add("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
            userAgentlist.add("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763");
            userAgentlist.add("Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko");
            userAgentlist.add("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:65.0) Gecko/20100101 Firefox/65.0");
            userAgentlist.add("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36");
        }
    
        public static String getHtml(String url){
            CloseableHttpClient httpClient = SSLClient.createSSLClient();
            HttpGet httpGet = new HttpGet(url);
            httpGet.setConfig(config);
            httpGet.setHeader("User-Agent",userAgentlist.get(new Random().nextInt(userAgentlist.size())));
            CloseableHttpResponse response = null;
            try {
                response = httpClient.execute(httpGet);
                if (response.getStatusLine().getStatusCode() == 200){
                    String html ="";
                    if (response.getEntity() !=null) {
                        html = EntityUtils.toString(response.getEntity(), "UTF-8");
                    }
                    return html;
                }
            } catch (IOException e) {
                e.printStackTrace();
            }finally {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
    
            return null;
        }
    }
    

    跳过SSL证书访问

    public class SSLClient {
        public static CloseableHttpClient createSSLClient(){
            SSLContext sslContext = null;
            try {
                sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {
                    @Override
                    public boolean isTrusted(X509Certificate[] chain, String authType) throws CertificateException {
                        //通过所有证书
                        return true;
                    }
                }).build();
                SSLConnectionSocketFactory sslSocketFactory = new SSLConnectionSocketFactory(sslContext, new HostnameVerifier() {
                    @OverrideSS
                    public boolean verify(String s, SSLSession sslSession) {
                        //不验证hostname
                        return true;
                    }
                });
                return HttpClients.custom().setSSLSocketFactory(sslSocketFactory).build();
                //如果产生了异常,就创建普通的client
            }catch (Exception e){
                e.printStackTrace();
            }
            return  HttpClients.createDefault();
        }
    }
    

    所使用的依赖

    		<dependency>
                <groupId>com.alibaba</groupId>
                <artifactId>fastjson</artifactId>
                <version>1.2.22</version>
            </dependency>
            <dependency>
                <groupId>org.apache.httpcomponents</groupId>
                <artifactId>httpclient</artifactId>
                <version>4.5.3</version>
            </dependency>
            <dependency>
                <groupId>org.jsoup</groupId>
                <artifactId>jsoup</artifactId>
                <version>1.10.3</version>
            </dependency>
            <dependency>
                <groupId>org.apache.commons</groupId>
                <artifactId>commons-lang3</artifactId>
                <version>3.7</version>
            </dependency>
            <dependency>
                <groupId>commons-io</groupId>
                <artifactId>commons-io</artifactId>
                <version>2.7</version>
            </dependency>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
                <version>1.7.25</version>
            </dependency>
    
    展开全文
  • Java爬取接口的数据 首先让大家看看爬取的数据结果 46884 条数据 由于**原因就不公布内容了 ![在这里插入图片描述](https://img-blog.csdnimg.cn/20200827100145755.png?x-oss-process=image/watermark,type_ZmFuZ3...

    Java爬取接口的数据

    首先让大家看看爬取的数据结果 46884 条数据

    在这里插入图片描述

    这是爬出文件ing
    在这里插入图片描述
    这是收获
    在这里插入图片描述
    废话不多说直接上代码

    因为有数据才能爬文件所以来一段爬取数据的代码先

    pom文件添加的依赖包
    只添加一下关键的包

       <!--commons-->
            <dependency>
                <groupId>org.apache.commons</groupId>
                <artifactId>commons-lang3</artifactId>
                <version>3.5</version>
            </dependency>
            <dependency>
                <groupId>commons-io</groupId>
                <artifactId>commons-io</artifactId>
                <version>2.5</version>
            </dependency>
            <!--commons-->
            <dependency>
                <groupId>org.apache.commons</groupId>
                <artifactId>commons-lang3</artifactId>
                <version>3.5</version>
            </dependency>
            <!-- MybatisPlus -->
            <dependency>
                <groupId>com.baomidou</groupId>
                <artifactId>mybatis-plus-boot-starter</artifactId>
                <version>3.1.1</version>
            </dependency>
            <!-- Gson -->
            <dependency>
                <groupId>com.google.code.gson</groupId>
                <artifactId>gson</artifactId>
                <version>2.8.5</version>
            </dependency>
            <!-- okhttp -->
            <dependency>
                <groupId>com.squareup.okhttp3</groupId>
                <artifactId>okhttp</artifactId>
                <version>3.14.2</version>
            </dependency>
            <dependency>
                <groupId>cn.hutool</groupId>
                <artifactId>hutool-all</artifactId>
                <version>4.5.16</version>
            </dependency>
    

    关键代码

     /**
         * 
         * @param param1 参数1
         * @param param2 参数2
         * @param param3 参数3
         * 
         */
        private void getDataToLocalDataBase(String param1 , String param2 , String param3) {
            HttpParam httpParam = new HttpParam();
            httpParam.setApiUrl("爬取的网站");
            httpParam.setApiPath("接口地址");
            Map<String, String> parms = new HashMap<>();
    
            parms.put("param1 ", param1 );
            parms.put("param2 ", param2 );
            parms.put("param3 ", param3 );
            //....更多参数...
            /*     parms.put("strCustName","");*/
            //创建格式化参数
            Gson paramGson = new GsonBuilder().create();
            String requestParam = paramGson.toJson(parms);
            try {
            //post请求
                HttpResult postResult = HttpUtil.post(httpParam, requestParam);
                String result = postResult.getResult();
    
                int status = postResult.getStatus();
                Gson gson = new Gson();
                if (status == 200) {
                    if (!StringUtils.isEmpty(result)) {
                        JsonObject jsonObject = (JsonObject) new JsonParser().parse(result);
                        JsonElement jsonElement = jsonObject.get("result");
                        String newResult = jsonElement.toString();
                        //xxData 与接口值返回相同的实体类  List<xxData>这里也可也是其他类型 按需去做
                        List<xxData> list = gson.fromJson(newResult, new TypeToken<List<xxData>>() {
                        }.getType());
    
                        log.info("数据有:{}",list.size());
                        if (list != null && list.size() > 0) {
                           //业务代码...把数据插入到本地数据库
                    } else {
                        log.info("无数据");
                    }
                } else {
                   log.error("错误数据{}", result );
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    

    上面用的工具类

    • HttpParam
    import okhttp3.MediaType;
    
    
    public class HttpParam {
    	//编码格式
        public static final MediaType MEDIA_TYPE_JSON = MediaType.parse("application/json; charset=utf-8");
    
        /**
         * 接口URL
         */
        private String apiUrl;
    
        /**
         * 接口路径
         */
        private String apiPath;
    
        /**
         * 读取超时时间
         */
        private int readTimeout = 30 * 1000;
    
        /**
         * 写入超时时间
         */
        private int writeTimeout = 30 * 1000;
    
        /**
         * 连接超时时间
         */
        private int connectTimeout = 2 * 1000;
    
        /**
         * 编码类型
         */
        private MediaType mediaType = MEDIA_TYPE_JSON;
    
        public String getApiUrl() {
            return apiUrl;
        }
    
        public void setApiUrl(String apiUrl) {
            this.apiUrl = apiUrl;
        }
    
        public String getApiPath() {
            return apiPath;
        }
    
        public void setApiPath(String apiPath) {
            this.apiPath = apiPath;
        }
    
        public int getReadTimeout() {
            return readTimeout;
        }
    
        public void setReadTimeout(int readTimeout) {
            this.readTimeout = readTimeout;
        }
    
        public int getWriteTimeout() {
            return writeTimeout;
        }
    
        public void setWriteTimeout(int writeTimeout) {
            this.writeTimeout = writeTimeout;
        }
    
        public int getConnectTimeout() {
            return connectTimeout;
        }
    
        public void setConnectTimeout(int connectTimeout) {
            this.connectTimeout = connectTimeout;
        }
    
        public MediaType getMediaType() {
            return mediaType;
        }
    
        public void setMediaType(MediaType mediaType) {
            this.mediaType = mediaType;
        }
    }
    
    • HttpResult 这个大家接收的可根据需要自定义
    public class HttpResult<T> {
    
        private int status;
        private String result;
        private T resultObject;
    
        public HttpResult() {
        }
    
        public HttpResult(int status, String result, T resultObject) {
            this.status = status;
            this.result = result;
            this.resultObject = resultObject;
        }
    
        public int getStatus() {
            return status;
        }
    
        public void setStatus(int status) {
            this.status = status;
        }
    
        public String getResult() {
            return result;
        }
    
        public void setResult(String result) {
            this.result = result;
        }
    
        public T getResultObject() {
            return resultObject;
        }
    
        public void setResultObject(T resultObject) {
            this.resultObject = resultObject;
        }
    }
    

    重点来了

    • HttpUtil
    import com.google.gson.Gson;
    import com.google.gson.GsonBuilder;
    import lombok.extern.slf4j.Slf4j;
    import okhttp3.*;
    
    import java.io.IOException;
    import java.util.Map;
    import java.util.concurrent.TimeUnit;
    
    
    @Slf4j
    public class HttpUtil {
    
        private static Gson gson = new GsonBuilder().serializeNulls().disableHtmlEscaping().create();
    
        /**
         * get请求
         */
        public static String get(HttpParam restParam) throws Exception {
            String url = restParam.getApiUrl();
    
            if (restParam.getApiPath() != null) {
                url = url+restParam.getApiPath();
            }
            Request request = new Request.Builder()
                    .url(url)
                    .get()
                    .build();
            return exec(restParam, request).getResult();
        }
    
        /**
         * get请求
         */
        public static <T> HttpResult<T> get(HttpParam restParam, Class<T> tClass) throws Exception {
            String url = restParam.getApiUrl();
    
            if (restParam.getApiPath() != null) {
                url = url+restParam.getApiPath();
            }
            Request request = new Request.Builder()
                    .url(url)
                    .get()
                    .build();
            return exec(restParam, request, tClass);
        }
    
        /**
         * POST请求json数据
         */
        public static <T> HttpResult<T> post(HttpParam restParam, Class<T> tClass) throws Exception {
            String url = restParam.getApiUrl();
            if (restParam.getApiPath() != null) {
                url = url + restParam.getApiPath();
            }
            Request request = new Request.Builder().url(url).build();
            return exec(restParam, request, tClass);
        }
    
        /**
         * POST请求json数据
         */
        public static <T> HttpResult<T> post(HttpParam restParam, String reqJsonData, Class<T> tClass) throws Exception {
            String url = restParam.getApiUrl();
            if (restParam.getApiPath() != null) {
                url = url+restParam.getApiPath();
            }
            RequestBody body = RequestBody.create(restParam.getMediaType(), reqJsonData);
            Request request = new Request.Builder()
                    .url(url).post(body).build();
            return exec(restParam, request, tClass);
        }
    
        /**
         * POST请求map数据
         */
        public static <T> HttpResult<T> post(HttpParam restParam, Map<String, String> parms, Class<T> tClass) throws Exception {
            String url = restParam.getApiUrl();
            if (restParam.getApiPath() != null) {
                url = url+restParam.getApiPath();
            }
            FormBody.Builder builder = new FormBody.Builder();
            if (parms != null) {
                for (Map.Entry<String, String> entry : parms.entrySet()) {
                    builder.add(entry.getKey(), entry.getValue());
                }
            }
            FormBody body = builder.build();
            Request request = new Request.Builder()
                    .url(url)
                    .post(body)
                    .build();
            return exec(restParam, request, tClass);
        }
    
        /**
         * POST请求map数据 返回结果
         */
        public static <T> HttpResult<T> post(HttpParam restParam,  String reqJsonData) throws Exception {
            String url = restParam.getApiUrl();
            if (restParam.getApiPath() != null) {
                url = url+restParam.getApiPath();
            }
            RequestBody body = RequestBody.create(restParam.getMediaType(), reqJsonData);
            Request request = new Request.Builder()
                    .url(url).post(body).build();
            return exec(restParam, request);
        }
    
        /**
         * 返回值封装成对象
         */
        private static <T> HttpResult<T> exec(
                HttpParam restParam,
                Request request,
                Class<T> tClass) throws Exception {
    
            HttpResult clientResult = exec(restParam, request);
            String result = clientResult.getResult();
            int status = clientResult.getStatus();
    
            T t = null;
            if (status == 200) {
                if (result != null && "".equalsIgnoreCase(result)) {
                    t = gson.fromJson(result, tClass);
                }
            } else {
                try {
                    result = gson.fromJson(result, String.class);
                } catch (Exception ex) {
                    ex.printStackTrace();
                }
            }
            return new HttpResult<>(clientResult.getStatus(), result, t);
        }
    
        /**
         * 执行方法
         */
        private static HttpResult exec(
                HttpParam restParam,
                Request request) throws Exception {
    
            HttpResult result = null;
    
            okhttp3.OkHttpClient client = null;
            ResponseBody responseBody = null;
            try {
                client = new okhttp3.OkHttpClient();
    
                client.newBuilder()
                        .connectTimeout(restParam.getConnectTimeout(), TimeUnit.MILLISECONDS)
                        .readTimeout(restParam.getReadTimeout(), TimeUnit.MILLISECONDS)
                        .writeTimeout(restParam.getWriteTimeout(), TimeUnit.MILLISECONDS);
    
                Response response = client.newCall(request).execute();
                if (response.isSuccessful()) {
                    responseBody = response.body();
                    if (responseBody != null) {
                        String responseString = responseBody.string();
    
                        result = new HttpResult<>(response.code(), responseString, null);
                    }
                } else {
                    throw new Exception(response.message());
                }
            } catch (Exception ex) {
                throw new Exception(ex.getMessage());
            } finally {
                if (responseBody != null) {
                    responseBody.close();
                }
                if (client != null) {
                    client.dispatcher().executorService().shutdown();   //清除并关闭线程池
                    client.connectionPool().evictAll();                 //清除并关闭连接池
                    try {
                        if (client.cache() != null) {
                            client.cache().close();                         //清除cache
                        }
                    } catch (IOException e) {
                        throw new Exception(e.getMessage());
                    }
                }
            }
            return result;
        }
    }
    
    

    Java爬取文件

    爬取数据的就告一段落了 接下来 就是拿这些数据去爬文件了 
    其实网上找了很多很多的文章爬文件的都不靠谱,最后借到了协助三行代码搞定了
    在这之前还是要导入关键的工具包滴,就是最上面的pom文件包
    
     URL url = new URL("文件地址");
     File folder = new File("本地存储文件地址");
     FileUtils.copyURLToFile(url , folder);
    

    github地址
    看完有用就点个赞吧

    展开全文
  • java爬取数据模板

    2021-05-24 16:54:11
    for (Element el:elements){ // 需要爬取的内容,根据网页标签名字而定 String img=el.getElementsByTag("p-img").eq(0).attr("src"); String price=el.getElementsByTag("p-price").eq(0).text(); String name=el....
    public class paqu {
    public static void main(String[] args) throws IOException{
        String url="";
        Document document=Jsoup.parse(new URL(url),30000);
    //    标签id
        Element element=document.getElementById("J_goodsList");
        System.out.println(element.html());
    //   需要爬取什么标签,我这里是li
        Elements elements=element.getElementsByTag("li");
    
        for (Element el:elements){
            
    //    需要爬取的内容,根据网页标签名字而定
            String img=el.getElementsByTag("p-img").eq(0).attr("src");
            String price=el.getElementsByTag("p-price").eq(0).text();
            String name=el.getElementsByTag("p-name").eq(0).text();
    
            System.out.println(img);
            System.out.println(price);
            System.out.println(name);
        }
    }}
    

    代码如上

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.10.2</version>
    </dependency>
    

    需要的jar包

    但是这个爬取只会显示在控制台那里,真正需要下载图片等信息的话,还需要其他方法。

    展开全文
  • java 爬取数据(三)

    千次阅读 2018-03-21 17:31:43
    因为本人在某司上班用到一些数据,所以就爬取公司web网的数据,具体代码就不给大家展示了,涉及一些公司隐私,嘻嘻 不过jsoup就是jquery类似的框架,稍微有点前端基础就可以完成的,希望大家也多做小demo,爬取相应的网站...

    因为本人在某司上班用到一些数据,所以就爬取公司web网的数据,具体代码就不给大家展示了,涉及一些公司隐私,嘻嘻
    不过jsoup就是jquery类似的框架,稍微有点前端基础就可以完成的,希望大家也多做小demo,爬取相应的网站练练手,不说了,我要去清洗数据去了.

    展开全文
  • 爬取数据我想应该不用多说了 Jsoup jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容 依赖:(maven仓库下载次数最多的版本) <dependency> <groupId>org.jsoup</groupId> &...
  • 小的近日在用java开源包htmlparser进行对网页中的数据爬取,但是很多网页中都嵌入了JS函数代码,必须鼠标点击,才触发JS函数,从服务器上索取数据(网页的静态地址不变,例如查看评论,收起回复),所以想问一下这样...
  • java爬取京东数据

    2018-08-10 09:24:35
    java爬取京东数据,利用java的dom类,运用request获取前端页面的dom,再通过特定的格式获取对应的标签。
  • JAVA爬取json数据

    2020-07-19 14:50:39
    JAVA爬取json数据 爬取码市上的项目生成文档方便查看。 查看码市的项目网址,便可看出这些项目都不是直接加载出来的,通过xhr看出是通过json的形式再次获取到的,获取到的地址中也可以看出分页也是通过最后的数字...
  • 主要介绍了Java爬取豆瓣电影数据的方法,结合实例形式详细分析了Java爬取豆瓣电影数据相关原理、操作步骤、实现技巧与注意事项,需要的朋友可以参考下
  • java 爬取 flash 里面的数据
  • 爬取网页数据代码 解析代码 解析介绍 完整代码 介绍 1.爬取通过org.jsoup和HttpClients实现 2.爬取多页内容的时候进行循环,多页进行爬取 3.爬取数据解析到jsonoup 4.取回数据使用文件保存直接保存到...
  • Java爬取网页数据

    千次阅读 2019-03-20 19:21:39
    要爬的网页:... 要爬这部分数据: 要的数据在源代码这部分: 首先定义数据: public class Information { String type; String volume; String money; String market_value; String numb...
  • JAVA爬取百度数据

    2019-01-22 10:18:28
    import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.FileReader; ...
  • java爬虫爬取数据

    2018-05-04 16:48:51
    利用HTML工具,多线程,消息队列,模拟浏览器实现爬取网页数据
  • java爬取网页数据

    2018-03-30 23:06:00
    最近使用java实现了一个简单的网页数据抓取,下面是实现原理及实现代码: 原理:使用java.net下面的URL对象获取一个链接,下载目标网页的源代码,利用jsoup解析源代码中的数据,获取你想要的内容 1.首先是根据网址...
  • Java爬取京东商品数据

    千次阅读 2018-09-21 22:44:26
    爬取京东商品数据 ...导入爬取数据需要的依赖包 编写httpClient工具类 编写pojo类 编写dao &amp;amp;amp;lt;dependencies&amp;amp;amp;gt; &amp;amp;amp;lt;dependency&amp;amp;amp;g
  • 简单java爬取网页数据

    2019-10-16 19:43:46
    静态爬取 public class HttpUtil { public static Document get(String url, String charset) throws IOException { String userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) " + ...
  • java爬取https网站的数据实现

    千次阅读 2019-10-14 13:59:53
    java爬取http类型的网站比较容易实现,因为不需要建立证书的通道,直接通过httpclient访问链接获取相应源码就可以获取相关数据,现在我们可以通过证书的方式,实现java爬取https网站的相关数据。 1.下载网站的证书 ...
  • import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; public class JdbcUtil { private static String url = "jdbc:mysql://localhost:3306/exam?serverTimezone=UTC"; ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 1,810
精华内容 724
关键字:

java爬取数据

java 订阅