JAVA爬虫天眼查、启信宝...企业信息查询网站
2022-10-24 15:45:48
241
{{single.collect_count}}

     闲来无事,做个快速收集企业信息导出Excel表的程序。所以...嘿嘿,开始对天眼查进行研究,废话不多说。

一、天眼查网站地址:https://www.tianyancha.com,到天眼查网站后例如:查询关键字:教育,天眼查说查询到100000+条企业信息,但是当你去翻页看的时候会发现在不登录的时候只能查看2页,后面就提示你登录查看更多了,那就登录一下,反正天眼查有短信快捷登录,登陆后,着手分析,(建议使用谷歌浏览器)F12调出开发者工具,Ctrl+Shift+C 点击咱们需要拿下来的信息块,嘿嘿...嘻嘻原来全在下图红框节点中啊!

然后知道它在这个区域,那么怎么把这个网页拿下来呢?

Java自带类就能实现这个问题!java.net.HttpURLConnection包就能模拟浏览器访问,直接上代码:

package com.zsx.crawler.utils.TianYanChaCompanyCrawler;import java.io.BufferedReader;import java.io.InputStream;import java.io.InputStreamReader;import java.net.HttpURLConnection;import java.net.URL;public class WebUtil {public static String getPageContent(String url){StringBuffer sb = new StringBuffer();try {// 建立连接URL u = new URL(url);HttpURLConnection httpUrlConn = (HttpURLConnection) u.openConnection();httpUrlConn.setDoInput(true);httpUrlConn.setRequestMethod("GET");//设置请求头//httpUrlConn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");//httpUrlConn.setRequestProperty("Accept-Encoding","gzip, deflate, br");//httpUrlConn.setRequestProperty("Accept-Language", "zh-CN,zh;q=0.9");//httpUrlConn.setRequestProperty("Connection", "keep-alive");//httpUrlConn.setRequestProperty("Host", "www.tianyancha.com");//httpUrlConn.setRequestProperty("Referer", "https://www.tianyancha.com/");//httpUrlConn.setRequestProperty("Upgrade-Insecure-Requests", "1");httpUrlConn.setRequestProperty("Cookie",这里就写使用浏览器访问天眼查携带的"小点心");httpUrlConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36");// 获取输入流InputStream is = httpUrlConn.getInputStream();// 将字节输入流转换为字符输入流InputStreamReader isr = new InputStreamReader(is, "utf-8");// 为字符输入流添加缓冲BufferedReader br = new BufferedReader(isr);// 读取返回结果String data = null;while ((data = br.readLine()) != null) {sb.append(data);System.out.println(data);}// 释放资源br.close();isr.close();is.close();httpUrlConn.disconnect();} catch (Exception e) {e.printStackTrace();}return sb.toString();}}

调用该方法打印返回值或者保存到txt文件就能看到,成功将该网站代码获取到,拿到后分析网站代码发现,噢.......原来还有隐藏域啊,隐藏域中竟然放了该页企业信息JSON数据,那就更简单了,直接从JSON数据中拿出想要的数据,存到Excel表就OK了!哈哈

上代码:

package com.zsx.crawler.utils.TianYanChaCompanyCrawler;import org.json.JSONObject;import java.util.regex.Matcher;import java.util.regex.Pattern;public class CrawlerCompanyUtil {public static void main(String []args){String web = WebUtil.getPageContent("https://www.tianyancha.com/search?key=%E6%95%99%E8%82%B2%E7%A7%91%E6%8A%80&base=bj");Pattern pageCount = Pattern.compile("<div class=\"result-footer\">.*?</div>");Matcher matcher1 = pageCount.matcher(web);while (matcher1.find()){ String group =matcher1.group(); System.out.println("页数:" + group);}Pattern companyInfo = Pattern.compile("<span class=\"tt hidden\">.*?</span>");Matcher matcher = companyInfo.matcher(web);while(matcher.find()){String group = matcher.group();//System.out.println(group);String eachGroup = group.substring(group.indexOf("<span class=\"tt hidden\">")+24,group.indexOf("</span>"));//System.out.println(eachGroup);JSONObject json = new JSONObject(eachGroup);//System.out.println(json);//公司名称String companyName = json.get("name").toString();//法人String legalperson = json.get("legalPersonName").toString();//注册资本String registeredfund = json.get("regCapital").toString();//注册时间String registeredtime = json.get("estiblishTime").toString();//电话列表String phone = json.get("phoneList").toString();//邮箱列表String email = json.get("emailList").toString();//注册地址String address = json.get("regLocation").toString();String qita = "经营领域:"+ json.get("businessScope") + "\n"+json.get("matchField");System.out.println(companyName);System.out.println(legalperson);System.out.println(registeredfund);System.out.println(registeredtime);System.out.println(email);System.out.println(address);System.out.println(qita);System.out.println();}}}

将代码存入数据库或者导出Excel表就看咱们的心情了! 

不过天眼查只允许普通用户查看前5页内容,所以我又去研究了启信宝网站,下篇咱们说说启信宝!爬虫启信宝文章中富含详细数据导出Excel表格代码,并且无限爬取数据,传送门——>Java爬虫启信宝

最后给各位看官来波福利!

阿里云服务器代金券和折扣免费领:https://promotion.aliyun.com/ntms/yunparter/invite.html?userCode=ypbt9nme

性能级主机2-5折:https://promotion.aliyun.com/ntms/act/enterprise-discount.html?userCode=ypbt9nme

新用户云通讯专享8折:https://www.aliyun.com/acts/alicomcloud/new-discount?userCode=ypbt9nme

新老用户云主机低4折专项地址:https://promotion.aliyun.com/ntms/act/qwbk.html?userCode=ypbt9nme

680元即可注册商标专项地址:https://tm.aliyun.com/?userCode=ypbt9nme

17元/月云主机:https://promotion.aliyun.com/ntms/act/qwbk.html?spm=5176.11533447.1097531.13.22805cfaiTv7SN&userCode=ypbt9nme

 

回帖
全部回帖({{commentCount}})
{{item.user.nickname}} {{item.user.group_title}} {{item.friend_time}}
{{item.content}}
{{item.comment_content_show ? '取消' : '回复'}} 删除
回帖
{{reply.user.nickname}} {{reply.user.group_title}} {{reply.friend_time}}
{{reply.content}}
{{reply.comment_content_show ? '取消' : '回复'}} 删除
回帖
收起
没有更多啦~
{{commentLoading ? '加载中...' : '查看更多评论'}}