2017-10-31

Big Data: 50 Fascinating and Free Data Sources for Data Visualization

Have you ever felt frustrated when try to look for some data on Google? Pages of relevant websites but none can fulfill your expectation? Have you ever felt that your articles are less persuasive without data support?

Let’s put this article on your favorite list, the most comprehensive guide of data sources, including General Data, Government Data, Market Data for U.S. and China, and etc.

General Data

1. World Health Organization

World Health Organization offers data and analysis on global health priorities, like world hunger, health, and disease.

2. The Broad Institute

The Broad Institute offers a number of data sets in biology and medicine.

3. Amazon Web Service

Amazon Web Service is a cross-science cloud-based data platform concerning Chemistry, Biology, Economy, and so on. It is also an attempt to build the most comprehensive database of human genetic information and NASA’s database of satellite imagery of Earth.

4. Figshare

Figshare is a platform for sharing research results. In here, you would be able to see some amazing findings from amazing people around the globe.

5. UCLA

Sometimes UCLA shares some of their findings in research papers.

6. UCI Machine Learning Repository

This website currently maintains 394 data sets as a service to the machine learning community

7. Github

Some cool guys built the GitHub community, sharing bunch of awesome data sets. Now the data inside is offered by everyone, and offered to everyone.

8. Pew Research Center

Pew Research Center offers its raw data from its fascinating research into American life.

Government Data

9. Data.gov

Data.gov is the home of the U.S. government’s open data. You could find data, tools and resources here to conduct research, data visualization, etc.

10. US Census Bureau

US Census Bureau is a wealth of information on the lives of US citizens covering population data, geographic data and education.

11. Open Data Network

Open Data Network is an easy-searching website for you to find government related data, with nice visualization tools built-in.

12. European Union Open data portal

The European Union Open Data is for accessing a growing range of data from European Union Institution.

13. Canada Open Data

Canada Open Data enables you to get quick, easy access to the government of Canada's most requested services and information.

14. Open Government Data

This website provides visitors with great open government data from US, EU, Canada, CKAN, and more.

15. The CIA World Factbook

The World Factbook provides information on the history, people, government, economy, geography, communications, transportation, military, and transnational issues for 267 world entities.

16. Gov.uk

Gov.uk is the data from the UK Government, including the British National Bibliography – metadata on all UK books and publications since 1950.

17. Health Data.gov

Health Data Gov is dedicated to making high value health data more accessible to entrepreneurs, researchers, and policy makers in the hopes of better health outcomes for all. It has 125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics.

18. UNICEF

Unicef offers statistics and reports on the situation of children worldwide.

19. National Climatic Data Center

National Climatic Data Center is a huge collection of environmental, meteorological and climate data sets from the US National Climatic Data Center. The world’s largest archive of weather data.

Google

20. Google public data includes data from world development indicators, OECD, and human development indicators, mostly related to economics data and the world.

21. Google Trends Statistics on search volume (as a proportion of total search) for any given term, since 2004.

22. Google Finance 40 years’ worth of stock market data, updated in real time.

Market Data

E-commerce Platforms

23. Amazon

24. Ebay

Two of the biggest E-commerce platforms in U.S., listing tons of products for customers. In the meanwhile, offers product information for marketers and researchers for analyses.

25. Yelp

Many restaurants are listed on this website, customers do reviews about restaurants and these reviews can help other customers choose which restaurants to dine. Also, restaurant information and customer reviews are extremely valuable for marketer to study.

26. Yellowpages

Yellowpages is a big brand even before we entered the Internet era. The website offers business info.

27. Cars

This website provides car info, both used cars and new cars. Also including the owner’s contacting info.

Real Estate

28. Real Estate

29. Zillow

30. Realtor

These three websites list houses, apartments that are on sales or for rent, and offers very comprehensive housing info.

31. Trip Advisor

A platform for customers to review great hotels around the world. It allows visitors to find the best hotels for vacation through reviews, and these reviews are very worthy studying if you’re in hotel industry.

32. Glassdoor

A recruiting website, listing thousands of vacant positions. Information extracted can be used for labor cost studying.

33. Linkedin

The best social media for formal communicating. Thousands of users are registered, and user profiles are quite convincing. Very useful for people try to find a job, or get some sales leads.

Chinese Market

As we all know, China is a market with huge potentials, so I also listed some websites to get Chinese Market Data.

Real Estate

34. www.58.com （58同城）

35. www.anjuke.com （安居客）

36. www.qfang.com （Q房网）

37. Fang.com （房天下）

These websites gather comprehensive data of real estates in China. The blooming of China real estate market makes the housing price a hot spot for the society, these sites offer massive and reliable data for people doing their research about real estate in China.

E-commerce Platforms

38. JD.com

39. Tmall.com

40. Taobao.com

China’s E-commerce companies supply massive products to the world, many gadgets are imported from China, and they swept over the globe, so how do China’s E-commerce platform look like? Ain’t you guys curious about?

Car Markets

41. www.autohome.com.cn（汽车之家）

China has always been a market with huge potential that every car manufacturer wants to seize. The best website for market researchers to collect data is AUTOHOME, gathers tons of data and consumer reviews, best for Chinese car market analysis.

Car Rental Market

42. www.zuche.com（神州租车）

43. www.1hai.cn（一嗨租车）

These two websites lead the China Car rental market. Collecting car usage information can help you conduct analyses relevant.

Transportation, Hotel, Travel

44. www.ctrip.com （携程网）

45. www.qunar.com（去哪儿网）

Extract data from these websites, you’ll be able to get the knowledge of how transportation, hotel, and travel markets are going in China.

Catering Market

46. www.dianping.com（大众点评）

47. www.meituan.com（美团网）

The above websites are similar to Yelp, and due to the richer and richer Chinese people are getting, these sites comment quality is relatively high, because people is getting more picky.

Recruiting Websites

48. www.lagou.com（拉勾网）

49. www.zhaopin.com（智联招聘）

50. www.51job.com（前程无忧）

These websites gain thousands of undergraduate users every year, for it offering great jobs. My idea of extracting data from these sites, we can learn the market demands over certain industries.

Nowadays it's a world of information integration, data sources shown above are just the tip of the iceberg. Since we are entering the big data era, it's no more about we utilizing the data, we move forward, conversely, it's about if we don't utilize the data, we fall back. As an ancient Chinese proverb says, "He who does not advance loses ground."

Related sources:

Top 30 Big Data Tools for Data Analysis

Top 30 Free Web Scraping Software

2017-10-17

Octoparse：ノンプログラマーのためのスクレイピングツール

WEBサービス data tool スタートアップツールデータマーケティングリサーチ

前回Octoparseというツールを紹介し、そのツールの登録、ダウンロード、インストール、データ抽出などの利用方法を紹介しました。（前回の内容の詳細については、こちらをご覧下さい。）今回は、Octoparseをもっと理解して頂くために、主な特長、具体例による使用方法および幾つかの拡張機能を紹介します。

https://nelog.jp/octoparse#Octoparse

１．概要

２．Octoparseの主な特長のご紹介

３．具体例による使用方法のご説明

４．Octoparseの拡張機能のご紹介

５．まとめ

１．概要

Octoparseは、簡単かつ非常に視覚的に理解しやすいWebスクレイパーであり、あまりプログラミングの知識が無い人でも、Webからデータを収集して抽出することができます。

ブランド	Octoparse
顧客サポート	Facebookコミュニティ、電話、電子メール、Skype
価格	75ドル～（無料バージョン提供）
試用期間	5日間（プロ版）
オペレーティングシステム	Windows XP, 7, 8, 9, 10
データエクスポート形式	CSV, Excel, Txt, Html, データベース (SqlServer, MySql, Oracle)
マルチスレッド	有り（無制限）
API（アプリケーションプログラミングインターフェース）	有り
スケジューリング	有り
クラウドサービス	有り

２．Octoparseの主な特長のご紹介

（１）クリックとドラッグによる簡単なWebスクレイピング

Octoparseは、全てのユーザーがWebスクレイピングを利用できるようなツールです。そのインターフェースは、ユーザーが非常に視覚的に理解しやすい操作画面のペイン（領域）となっています。基本的には、「クリック」、「ポイント」及び「ドラッグ」で、既存のWebサイトの98%をスクレイピングするのに非常に機能的なワークフローを作成できます。

f:id:norachoi2016:20171017184444p:plain

（２）動的なWebサイトへの対応

より複雑なスクレイピングについて、例えば、データが相互交流型のWebサイト上でJavaScriptを使用して読み込まれるとき、Octoparseは下記の全ての場合において解決案を提供することができます。

ログイン後のスクレイピング
検索ベースの抽出
Ajaxで読み込まれたスクレイピングデータ
無限スクロール
「Next」ボタンが無いページネーション
ネスト（入れ子）構造のドロップダウンメニュー
フォームへの記入
HTML内で非表示にされたキャプチャデータ

などなど。

Octoparseは全てのユーザーがデータをクローリングできるように設計されています。Octoparseに内蔵されているXPath及びRegExツールを利用することにより、開発者はもちろん、開発者以外の人でも、Webページ上の一つ一つの要素を簡単に完全照合できます。（直接拡張機能のページをご覧下さい。）

（３）サポート

無料版を使用しているユーザーの場合、FacebookのOctoparseグループのヘルプを参照して下さい。そのコミュニティのグループメンバーたちは熱心に協力して説明してくれると思います。また、Octoparseサポートに連絡する方法もありますが、対応に時間が掛かるかもしれません。

有料版を使用しているユーザーの場合、Octoparseチームが優先的に対応し、電話、電子メール及びSkypeを通じてサポートします。

３．具体例による使用方法のご説明

上記では、Octoparseの主な特長について簡単に紹介しました。ここでは、さらに知りたい場合に備えて、シナリオを作成し、具体例を挙げて説明します。

あなたは、自分が東京に引っ越してきたばかりの若い従業員だと想像してみて下さい。最初に解決すべきことは、賃貸アパートを探すことですよね？賃貸アパートに関する情報はネット上にたくさんあるので、どの賃貸アパートに決めれば良いかわからないと思います。ここで、もし整理された賃貸アパートのリストがあれば、より簡単に比較することができますよね？Octoparseはそのような場合に役に立つ最良のツールになると思います。

suumo.jpは不動産・賃貸住宅に関する最大の総合情報サイトで、投資家、新入社員および住宅需要のある人向けに多くの情報を提供しています。自分が、渋谷駅、新宿駅、原宿駅から15分以内で、家賃が15万円以下のアパートを探していると仮定して、今からOctoparseでスクレイピングしてみましょう。

ステップ１．Basic Informationを設定します。

「Quick Start」をクリックします。➜ New Task (Advanced Mode)をクリックします。➜Basic Informationを完成させます。

f:id:norachoi2016:20171017184639p:plain

ステップ２．内蔵されているブラウザで検索したいWebサイトに移動します。

内蔵されているWeb ブラウザに検索したいURLを入力します。➜ 「Go」をクリックしてサイトを開きます。

URLの例：

http://suumo.jp/jj/chintai/ichiran/FR301FC005/?ar=030&bs=040&pc=100&smk=&po1=00&po2=99&shkr1=03&shkr2=03&shkr3=03&shkr4=03&rn=0005&ek=000517640&ek=000531250&ek=000519670&ra=013&cb=0.0&ct=15.0&et=15&mb=0&mt=9999999&cn=9999999&fw2=

f:id:norachoi2016:20171017184829p:plain

ステップ３．ページネーションを設定します。

「次へ」（ページネーションリンク）をクリックします。➜「Loop click the element」を選びます。

f:id:norachoi2016:20171017184915p:plain

ステップ４．項目のリストを作成します。

「Loop item」をWorkflowにドラッグします。➜「Variable list」を選びます。➜その下の「Variable list」の横にある空欄に、下記のXPathを貼り付けます。➜「Save」をクリックします。

XPath：//div[@class='property_group']/div（XPathの詳細については、こちらをご覧下さい。）

f:id:norachoi2016:20171017185020p:plain

ステップ５．検索結果を抽出します。

タイトル部分を抽出します。➜タイトルをクリックします。➜「Extract text」を選びます。他のコンテンツも同じ方法で抽出することができます。

f:id:norachoi2016:20171017185103p:plain

ステップ６． 抽出されたData Fieldの名前を修正します。

全てのData Fieldは抽出されると、自動的に名前が付けられます。➜名前を修正したい場合は、「Field Name」をクリックして修正します。

f:id:norachoi2016:20171017185135p:plain

ステップ７．ページネーションのXPathを修正します。

Octoparseで設定されたデフォルトのXPathでは、「次へ」という項目を正しく配置できないので、XPathを修正する必要があります。修正されたXPathは次の通りです。

//P[@class='pagination-parts']/A[contains(text(),'次へ')] （XPathの詳細については、こちらをご覧下さい。）

f:id:norachoi2016:20171017185207p:plain

ステップ８．エクストラクタを実行します。

「Next」をクリックします。➜「Next」クリックします。➜「Local Extraction」をクリックします。➜「OK」をクリックしてコンピューター上でタスクを実行します。Octoparseは、指定した全てのデータを自動的に抽出します。

f:id:norachoi2016:20171017185248p:plain

以上のステップが全部完了したら、下記のようなきちんと分類されたデータが得られます。

f:id:norachoi2016:20171017185320p:plain

４．Octoparseの拡張機能のご紹介

（１）XPath及び RegEx（正規表現）というツール

―ネクストレベルのWebスクレイピング

XPathと正規表現は、複雑なWebスクレイピングをするために必須の技術ですが、初心者が利用するのはそう簡単なことではありません。そこで、Octoparseチームは、正確なWebスクレイピングをするのに必要なXPathとRegExについて、誰でも簡単に作れるように十分配慮されたツールを提供しています。

ａ．XPathツール

f:id:norachoi2016:20171017185355p:plain

OctoparseのXPathツールの画面は、4つのペイン（領域）で構成されています。

ペイン１：ブラウザのペイン。内蔵されているブラウザで検索したいURLを入力し、「Go」をクリックすると、Webページのコンテンツが表示されます。

ペイン２：ソースコードのペイン。Webページのソースコードが表示されます。

ペイン３：XPathの設定のペイン。選択肢をチェックし、幾つかのパラメーターを入力して、「Generate」をクリックすると、XPath式が作成されます。

ペイン４：XPathの結果のペイン。XPathが作成された後、「Match」をクリックすると、現在のXPathがWebページの要素を見つけているかどうかを確認できます。

OctoparseのXPathツールの詳細については、こちらをご覧下さい。

ｂ．RegEx（正規表現）ツール

正規表現とは、文字列内での文字の組み合わせを照合させるために用いられるパターンです。どんなスクレイピングシナリオでも、例えば、CSS セレクタやXPathがうまく動作しない場合でも、正規表現構文を使用して必要な情報をすぐに検索することができます。XPathツールと同様に、Octoparseには内蔵されているRegExツールがあります。このRegExツールがあれば、ユーザーは文字や文字列の一致に苦労する必要は無く、シンプルに幾つかの条件を入力するだけで、RegExは自動的に作成されます。

f:id:norachoi2016:20171017185423p:plain

OctoparseのRegExツールの詳細については、こちらをご覧下さい。

ｃ．データの再配置（再フォーマット）ツール

さて、欲しいデータをうまく抽出しましたが、そのデータは利用しやすい形式ではありません。例えば、日付の書式が間違っていたり、単語間に不要な空白があったり、不要な接頭辞や接尾辞が付いていたりする点です。そこで、Octoparseは内蔵されたデータの再配置ツールを使用して、簡単に必要なデータ変換をできるようにしました。サポートされている変換機能は次の8個になります。

①Replace：抽出したデータの文字列やキーワードを置換します。

②Replace with regular expression：特定の正規表現に一致するコンテンツを置換します。

③Match with regular expression：乱雑な単語の中から目的のキーワードを選び出します。

④Trim spaces：抽出したデータの前後の空白を削除します。

⑤Add prefix：抽出したデータの先頭に必要なもの（番号、文字、信号など）を追加します。

⑥Add suffix：データの最後に何かを追加します。これは「Add prefix」とちょうど逆になります。

⑦Re-format extracted date/time：希望の日付や時刻の書式を設定し直します。

⑧Html transcoding：HTMLソースを抽出するときに、HTMLエンコードされた文字をエンコードされていないテキストにデコードします。

Octoparseにキャプチャされたデータの再配置の詳細については、こちらをご覧下さい。

（２）クラウドサービス

Octoparseは、ユーザーのスクレイピング技能をさらに強化するために（有料版を使用しているユーザー向けに）クラウドサービスを提供しています。このクラウドサービスでは、次の4つのオプションが使用できます。

①スケジュール通りの自動データスクレイピング

ユーザーは、いつでも、たとえリアルタイムでも、スクレイピングを実行できるようにクローラーの予定を決めることができます。

f:id:norachoi2016:20171017185446p:plain

②リアルタイム抽出のためのAPI経由での接続

RESTful APIに接続すると、抽出されたデータをリアルタイムなど任意の希望する頻度で取得できます。

③IPブロッキングを防ぐIPローテーション

これまでに、Webサイトをよくスクレイピングする場面で、IPアドレスが使えなくなってWebサイトにアクセスできなくなって、ものすごくイライラしたことがありますか？ありますよね。例えば、ソーシャルプラットフォームや企業電話帳などの注目を集めるWebサイトからデータを抽出している場合は、特によく起こります。しかし、Octoparseを使用すると、匿名のHTTPプロキシ・サーバーを何台も使い回して、ブロックされる可能性を最小限に抑えることにより、これらのWebサイトをスクレイピングすることができます。

④データのデータベースへの自動エクスポート

Octoparseのクラウドサービスは、SQLサーバー、MySQL及びOracleのデータベースへの自動エクスポートもサポートしています。ここの説明を読み、データベースをOctoparseに接続する手順に従って下さい。

５．まとめ

Octoparseは、機能豊富な視覚的に理解しやすいWebスクレイピングツールです。特に、ノンテクニカルユーザーが簡単にWebスクレイピングできるという点では、間違いなく支持できます。Octoparseのソフトウェアは、優秀かつ汎用性が高いので、ほとんどの動的なサイトをかなり簡単にスクレイピングできます。また、無制限のWebページのスクレイピングをサポートしている無料のプランが付いてこの価格なのも、明らかに「財布に優しい」です。以上のことから、Octoparseは、絶対に試す価値があります。

2017-08-28

Top 30 Big Data Tools for Data Analysis

スタートアップツールデータマーケティングリサーチ data tool

There are thousands of big data tools out there for data analysis today. Data analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. To save your time, in this post, I will list out 30 top big data tools for data analysis in the areas ofopen source data tools, data visualization tools, sentiment tools, data extraction tools and databases.

Open Source Data Tools

1. Knime

KNIME Analytics Platform is the leading open solution for data-driven innovation, helping you discover the potential hidden in your data, mine for fresh insights, or predict new futures.

With more than 1000 modules, hundreds of ready-to-run examples, a comprehensive range of integrated tools, and the widest choice of advanced algorithms available, KNIME Analytics Platform is the perfect toolbox for any data scientist.

2. OpenRefine

OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it, transforming it from one format into another, and extending it with web services and external data. OpenRefine can help you explore large data sets with ease.

3. R-Programming

What if I tell you that Project R, a GNU project, is written in R itself? It’s primarily written in C and Fortran. And a lot of its modules are written in R itself. It’s a free software programming language and software environment for statistical computing and graphics. The R language is widely used among data miners for developing statistical software and data analysis. Ease of use and extensibility has raised R’s popularity substantially in recent years.

Besides data mining it provides statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others.

4. Orange

Orange is open source data visualization and data analysis for novice and expert, and provides interactive workflows with a large toolbox to create interactive workflows to analyse and visualize data. Orange is packed with different visualizations, from scatter plots, bar charts, trees, to dendrograms, networks and heat maps.

5. RapidMiner

Much like KNIME, RapidMiner operates through visual programming and is capable of manipulating, analyzing and modeling data. RapidMiner makes data science teams more productive through an open source platform for data prep, machine learning, and model deployment. Its unified data science platform accelerates the building of complete analytical workflows – from data prep to machine learning to model validation to deployment – in a single environment, dramatically improving efficiency and shortening the time to value for data science projects.

6. Pentaho

Pentaho addresses the barriers that block your organization's ability to get value from all your data. The platform simplifies preparing and blending any data and includes a spectrum of tools to easily analyze, visualize, explore, report and predict. Open, embeddable and extensible, Pentaho is architected to ensure that each member of your team — from developers to business users — can easily translate data into value.

7. Talend

Talend is the leading open source integration software provider to data-driven enterprises. Our customers connect anywhere, at any speed. From ground to cloud and batch to streaming, data or application integration, Talend connects at big data scale, 5x faster and at 1/5th the cost.

8. Weka

Weka, an open source software, is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a data set or called from your own JAVA code. It is also well suited for developing new machine learning schemes, since it was fully implemented in the JAVA programming language, plus supporting several standard data mining tasks.

For someone who hasn’t coded for a while, Weka with its GUI provides easiest transition into the world of Data Science. Being written in Java, those with Java experience can call the library into their code as well.

9. NodeXL

NodeXL is a data visualization and analysis software of relationships and networks. NodeXL provides exact calculations. It is a free (not the pro one) and open-source network analysis and visualization software. It is one of the best statistical tools for data analysis which includes advanced network metrics, access to social media network data importers, and automation.

10. Gephi

Gephi is also an open-source network analysis and visualization software package written in Java on the NetBeans platform. Think of the giant friendship maps you see that represent linkedin or Facebook connections. Gelphi takes that a step further by providing exact calculations.

Data Visualization Tools

11. Datawrapper

Datawrapper is an online data-visualization tool for making interactive charts. Once you upload the data from CSV/PDF/Excel file or paste it directly into the field, Datawrapper will generate a bar, line, map or any other related visualization. Datawrapper graphs can be embedded into any website or CMS with ready-to-use embed codes. So many reporters and news organizations use Datawrapper to embed live charts into their articles. It is very easy to use and produces effective graphics.

12. Solver

Solver specializes in providing world-class financial reporting, budgeting and analysis with push-button access to all data sources that drive company-wide profitability. Solver provides BI360, which is available for cloud and on-premise deployment, focusing on four key analytics areas.

13. Qlik

Qlik lets you create visualizations, dashboards, and apps that answer your company’s most important questions. Now you can see the whole story that lives within your data.

14. Tableau Public

Tableau democratizes visualization in an elegantly simple and intuitive tool. It is exceptionally powerful in business because it communicates insights through data visualization. In the analytics process, Tableau's visuals allow you to quickly investigate a hypothesis, sanity check your gut, and just go explore the data before embarking on a treacherous statistical journey.

15. Google Fusion Tables

Fusion TablesMeet Google Spreadsheets cooler, larger, and much nerdier cousin. Google Fusion tables is an incredible tool for data analysis, large data-set visualization, and mapping. Not surprisingly, Google's incredible mapping software plays a big role in pushing this tool onto the list. Take for instance this map, which I made to look at oil production platforms in the Gulf of Mexico.

16. Infogram

Infogram offers over 35 interactive charts and more than 500 maps to help you visualize your data beautifully. Create a variety of charts including column, bar, pie, or word cloud. You can even add a map to your infographic or report to really impress your audience.

Sentiment Tools

17. Opentext

The OpenText Sentiment Analysis module is a specialized classification engine used to identify and evaluate subjective patterns and expressions of sentiment within textual content. The analysis is performed at the topic, sentence, and document level and is configured to recognize whether portions of text are factual or subjective and, in the latter case, if the opinion expressed within these pieces of content are positive, negative, mixed, or neutral.

18. Semantria

Semantria is a tool that offers a unique service approach by gathering texts, tweets, and other comments from clients and analyzing them meticulously to derive actionable and highly valuable insights. Semantria offers text analysis via API and Excel plugin. It differs from Lexalytics in that it is offered via API and Excel plugin, and in that it incorporates a bigger knowledge base and uses deep learning.

19.Trackur

Trackur’s automated sentiment analysis looks at the specific keyword you are monitoring and then determines if the sentiment towards that keyword is positive, negative or neutral with the document. That’s weighted the most in Trackur algorithm. It could use to monitor all social media and mainstream news, to gain executive insights through trends, keyword discovery, automated sentiment analysis and influence scoring.

20. SAS Sentiment Analysis

SAS sentiment analysis automatically extracts sentiments in real time or over a period of time with a unique combination of statistical modeling and rule-based natural language processing techniques. Built-in reports show patterns and detailed reactions. So you can hone in on the sentiments that are expressed.

With ongoing evaluations, you can refine models and adjust classifications to reflect emerging topics and new terms relevant to your customers, organization or industry.

21. Opinion Crawl

Opinion Crawl is an online sentiment analysis for current events, companies, products, and people. Opinion Crawl allows visitors to assess Web sentiment on a topic - a person, an event, a company or a product. You can enter a topic and get an ad-hoc sentiment assessment of it. For each topic you get a pie chart showing current real-time sentiment, a list of the latest news headlines, a few thumbnail images, and a tag cloud of key semantic concepts that the public associates with the subject. The concepts allow you to see what issues or events drive the sentiment in a positive or negative way. For more in-depth assessment, the web crawlers would find the latest published content on many popular subjects and current public issues, and calculate sentiment for them on ongoing basis. Then the blog posts would show the trend of sentiment over time, as well as the Positive-to-Negative ratio.

Data Extraction Tools

22. Octoparse

Octoparse is a free and powerful website crawler used for extracting almost all kind of data you need from the website. You can use Octoparse to rip a website with its extensive functionalities and capabilities. Its point-and-click UI helps non-programmers to quickly get used to Octoparse. It allows you to grab all the text from the website with AJAX, Javaxript and thus you can download almost all the website content and save it as a structured format like EXCEL, TXT, HTML or your databases.

More advanced, it has provided Scheduled Cloud Extraction which enables you to refresh the website and get the latest information from the website.

23. Content Grabber

Content Graber is a web crawling software targeted at enterprises. It can extract content from almost any website and save it as structured data in a format of your choice, including Excel reports, XML, CSV and most databases.

24. Import.io

Import.io is a paid web-based data extraction tool to pull information off of websites used to be something reserved for the nerds. Simply highlight what you want and Import.io walks you through and "learns" what you are looking for. From there, Import.io will dig, scrape, and pull data for you to analyze or export.

25. Parsehub

Parsehub is a great web crawler that supports collecting data from websites that use AJAX technologies, JavaScript, cookies and etc. Its machine learning technology can read, analyze and then transform web documents into relevant data. As a freeware, you can set up no more than five publice projects in Parsehub. The paid subscription plans allows you to create at least 20 private projects for scraping websites.

26. Mozenda

Mozenda is a cloud based web scraping service. It provides many useful utility features for data extraction. Users will be allowed to upload extracted data to cloud storage.

Databases

27. Data.gov

The US Government pledged last year to make all government data available freely online. This site is the first stage and acts as a portal to all sorts of amazing information on everything from climate to crime.

28. US Census Bureau

US Census Bureau is a wealth of information on the lives of US citizens covering population data, geographic data and education.

29. The CIA World Factbook

The World Factbook provides information on the history, people, government, economy, geography, communications, transportation, military, and transnational issues for 267 world entities.

30. PubMed

PubMed, developed by the National Library of Medicine (NLM), provides free access to MEDLINE, a database of more than 11 million bibliographic citations and abstracts from nearly 4,500 journals in the fields of medicine, nursing, dentistry, veterinary medicine, pharmacy, allied health, health care systems, and pre-clinical sciences. PubMed also contains links to the full-text versions of articles at participating publishers' Web sites. In addition, PubMed provides access and links to the integrated molecular biology databases maintained by the National Center for Biotechnology Information (NCBI). These databases contain DNA and protein sequences, 3-D protein structure data, population study data sets, and assemblies of complete genomes in an integrated system. Additional NLM bibliographic databases, such as AIDSLINE, are being added to PubMed. PubMed includes "Old Medline." Old Medline covers 1950-1965. (Updated daily)

More related resources:

Top 30 Free Web Scraping Software

9 FREE Web Scrapers That You Cannot Miss

Top 9 data visualization tools for non-developers

2017-05-17

Top 20 Web Crawler Tools to Collect Web Data (Webデータを収集するトップ20のWebクローラーツール)

データスタートアップツールマーケティングリサーチ

(from Top 20 Web Crawler Tools to Scrape the Websites | Octoparse, Free Web Scraping)

Web crawling (also known as web scraping) is widely applied in many areas today. It targets at fetching new or updated data from any websites and store the data for an easy access. Web crawler tools are getting well known to the common, since the web crawler has simplified and automated the entire crawling process to make web data resource become easily accessible to everyone. Using a web crawler tool will set free people from repetitive typing or copy-pasting, and we could expect a well-structured and all-inclusive data collection. Additionally, these web crawler tools enable users to crawl the world wide web in a methodical and fast manner without coding and transform the data into various formats conforming to their needs.

In this post, I’d propose top 20 popular web crawlers around the web for your reference. You may find the most suited web crawler that’s tailored to your needs.

1. Octoparse

Octoparse is a free and powerful website crawler used for extracting almost all kind of data you need from the website. You can use Octoparse to rip a website with its extensive functionalities and capabilities. There are two kinds of learning mode - Wizard Mode and Advanced Mode - for non-programmers to quickly get used to Octoparse. After downloading the freeware, its point-and-click UI allows you to grab all the text from the website and thus you can download almost all the website content and save it as a structured format like EXCEL, TXT, HTML or your databases.

More advanced, it has provided Scheduled Cloud Extraction which enables you to refresh the website and get the latest information from the website.

And you could extract many tough websites with difficult data block layout using its built-in Regex tool, and locate web elements precisely using the XPath configuration tool. You will not be bothered by IP blocking any more, since Octoparse offers IP Proxy Servers that will automates IP’s leaving without being detected by aggressive websites.

To conclude, Octoparse should be able to satisfy users’ most crawling needs, both basic or high-end, without any coding skills.

2. Cyotek WebCopy

WebCopy is a free website crawler that allows you to copy partial or full websites locally in to your harddisk for offline reading.

It will scan the specified website before downloading the website content onto your hardisk and auto-remap the links to resources like images and other web pages in the site to match its local path, excluding a section of the website. Additional options are also available such as downloading a URL to include in the copy, but not crawling it.

There are many settings you can make to configure how your website will be crawled, in addition to rules and forms mentioned above, you can also configure domain aliases, user agent strings, default documents and more.

However, WebCopy does not include a virtual DOM or any form of JavaScript parsing. If a website makes heavy use of JavaScript to operate, it is unlikely WebCopy will be able to make a true copy if it is unable to discover all of the website due to JavaScript being used to dynamically generate links.

3. HTTrack

As a website crawler freeware, HTTrack provides functions well suited for downloading an entire website from the Internet to your PC. It has provided versions available for Windows, Linux, Sun Solaris, and other Unix systems. It can mirror one site, or more than one site together (with shared links). You can decide the number of connections to opened concurrently while downloading web pages under “Set options”. You can get the photos, files, HTML code from the entire directories, update current mirrored website and resume interrupted downloads.

Plus, Proxy support is available with HTTTrack to maximize speed, with optional authentication.

HTTrack Works as a command-line program, or through a shell for both private (capture) or professionnal (on-line web mirror) use. With that saying, HTTrack should be preferred and used more by people with advanced programming skills.

4. Getleft

Getleft is a free and easy-to-use website grabber that can be used to rip a website. It downloads an entire website with its easy-to-use interface and multiple options. After you launch the Getleft, you can enter a URL and choose the files that should be downloaded before begin downloading the website. While it goes, it changes the original pages, all the links get changed to relative links, for local browsing.Additionally, it offers multilingual support, at present Getleft supports 14 languages.However, it only provides limited Ftp supports, it will download the files but not recursively.

On the whole, Getleft should satisfy users’ basic crawling needs without more complex tactical skills.

5. Scraper

Scraper is a Chrome extension with limited data extraction features but it’s helpful for making online research, and exporting data to Google Spreadsheets. This tool is intended for beginners as well as experts who can easily copy data to the clipboard or store to the spreadsheets using OAuth. Scraper is a free web crawler tool, which works right in your browser and auto-generates smaller XPaths for defining URLs to crawl. It may not offer all-inclusive crawling services, but novices also needn’t tackle messy configurations.

6. OutWit Hub

OutWit Hub is a Firefox add-on with dozens of data extraction features to simplify your web searches. This web crawler tool can browse through pages and store the extracted information in a proper format.

OutWit Hub offers a single interface for scraping tiny or huge amounts of data per needs. OutWit Hub lets you scrape any web page from the browser itself and even create automatic agents to extract data and format it per settings.

It is one of the simplest web scraping tools, which is free to use and offers you the convenience to extract web data without writing a single line of code.

7. ParseHub

The desktop application of Parsehub supports systems such as windows, Mac OS X and Linux, or you can use the web app that is built within the browser.

As a freeware, you can set up no more than five publice projects in Parsehub. The paid subscription plans allows you to create at least 20 private projects for scraping websites.

8. Visual Scraper

VisualScraper is another great free and non-coding web scraper with simple point-and-click interface and could be used to collect data from the web. You can get real-time data from several web pages and export the extracted data as CSV, XML, JSON or SQL files. Besides the SaaS, VisualScraper offer web scraping service such as data delivery services and createing software extractors services.

Visual Scraper enables users to schedule their projects to be run on specific time or repeat the sequence every minutes, days, week, month, year. Uers could use it to extract news, updates, forum frequently.

9. Scrapinghub

Scrapinghub is a cloud-based data extraction tool that helps thousands of developers to fetch valuable data. Its open source visual scraping tool, allows users to scrape websites without any programming knowledge.

Scrapinghub uses Crawlera, a smart proxy rotator that supports bypassing bot counter-measures to crawl huge or bot-protected sites easily. It enables users to crawl from multiple IPs and locations without the pain of proxy management through a simple HTTP API.

Scrapinghub converts the entire web page into organized content. Its team of experts are available for help in case its crawl builder can’t work your requirements. .

10. Dexi.io

As a browser-based web crawler, Dexi.io allows you to scrape data based on your browser from any website and provide three types of robot for you to create a scraping task - Extractor, Crawler and Pipes. The freeware provides anonymous web proxy servers for your web scraping and your extracted data will be hosted on Dexi.io’s servers for two weeks before the data is archived, or you can directly export the extracted data to JSON or CSV files. It offers paid services to meet your needs for getting real-time data.

11. Webhose.io

Webhose.io enables users to get real-time data from crawling online sources from all over the world into various, clean formats. This web crawler enables you to crawl data and further extract keywords in many different languages using multiple filters covering a wide array of sources.

And you can save the scraped data in XML, JSON and RSS formats. And users are allowed to access the history data from its Archive. Plus, webhose.io supports at most 80 languages with its crawling data results. And users can easily index and search the structured data crawled by Webhose.io.

On the whole, Webhose.io could satisfy users’ elementary crawling requirements.

12. Import. io

Users are able to form their own datasets by simply importing the data from a particular web page and exporting the data to CSV.

You can easily scrape thousands of web pages in minutes without writing a single line of code and build 1000+ APIs based on your requirements. Public APIs has provided powerful and flexible capabilities to control Import.io programmatically and gain automated access to the data, Import.io has made crawling easier by integrating web data into your own app or web site with just a few clicks.

To better serve users' crawling requirements, it also offers a free app for Windows, Mac OS X and Linux to build data extractors and crawlers, download data and sync with the online account. Plus, users are able to schedule crawling tasks weekly, daily or hourly.

13. 80legs

80legs is a powerful web crawling tool that can be configured based on customized requirements. It supports fetching huge amounts of data along with the option to download the extracted data instantly. 80legs provides high-performance web crawling that works rapidly and fetches required data in mere seconds

14. Spinn3r

Spinn3r allows you to fetch entire data from blogs, news & social media sites and RSS & ATOM feeds. Spinn3r is distributed with a firehouse API that manages 95% of the indexing work. It offers an advanced spam protection, which removes spam and inappropriate language uses, thus improving data safety.

Spinn3r indexes content similar to Google and saves the extracted data in JSON files. The web scraper constantly scans the web and finds updates from multiple sources to get you real-time publications. Its admin console lets you control crawls and full-text search allows making complex queries on raw data.

15. Content Grabber

Content Graber is a web crawling software targeted at enterprises. It allows you to create a stand-alone web crawling agents. It can extract content from almost any website and save it as structured data in a format of your choice, including Excel reports, XML, CSV and most databases.

It is more suitable for people with advanced programming skills, since it offers many powerful scripting editing, debugging interfaces for people in need. Users are allowed to use C# or VB.NET to debug or write script to control the crawling process programmingly. For example, Content Grabber can integrate with Visual Studio 2013 for the most powerful script editing, debugging and unit test for a advanced and tactful customized crawler based on users’ particular needs.

16. Helium Scraper

Helium Scraper is a visual web data crawling software that works pretty well when the association between elements is small. It’s non coding, non configuration. And users can get access to the online templates based for various crawling needs.

Basically, it could satisfy users’ crawling needs within an elementary level.

17. UiPath

UiPath is a robotic process automation software for free web scraping. It automates web and desktop data crawling out of most third-party Apps. You can install the robotic process automation software if you run Windows system. Uipath is able to extract tabular and pattern-based data across multiple web pages.

Uipath has provided the built-in tools for further crawling. This method is very effective when dealing complex UIs. The Screen Scraping Tool can handle both individual text elements, groups of text and blocks of text, such as data extraction in table format.

Plus, no programming is needed to create intelligent web agents, but the .NET hacker inside you will have complete control over the data.

18. Scrape. it

Scrape.it is a node.js web scraping software for humans. It’s a cloud-base web data extraction tool. It’s designed towards those with advanced programming skills, since it has offered both public and private packages to discover, reuse, update, and share code with millions of developers worldwide. Its powerful integration will help you build a customized crawler based on your needs.

19. WebHarvy

WebHarvy is a point-and-click web scraping software. It’s designed for non-programmers. WebHarvy can automatically scrape Text, Images, URLs & Emails from websites, and save the scraped content in various formats. It also provides built-in scheduler and proxy support which enables anonymously crawling and prevents the web scraping software from being blocked by web servers, you have the option to access target websites via proxy servers or VPN.

Users can save the data extracted from web pages in a variety of formats. The current version of WebHarvy Web Scraper allows you to export the scraped data as an XML, CSV, JSON or TSV file. User can also export the scraped data to an SQL database.

20. Connotate

Connotate is an automated web crawler designed for Enterprise-scale web content extraction which needs an enterprise-scale solution. Business users can easily create extraction agents in as little as minutes – without any programming. User can easily create extraction agents simply by point-and-click.

It is able to automatically extract over 95% of sites without programming, including complex JavaScript-based dynamic site technologies, such as Ajax. And Connotate supports any language for data crawling from most sites.

Additionally, Connotate also offers the function to integrate webpage and database content, including content from SQL databases and MongoDB for database extraction.

To conclude, the crawlers I mentioned above can satisfy the basic crawling needs for most users, while there are still many variance about their respective functionalities among these tools, since many of these crawler tools have provided more advanced and built-in configuration tools for users. Thus, be sure you have fully understand what characters an crawler has provided before you subscribe it.