精华内容
下载资源
问答
  • news_sources 用于在新闻文本中定义新闻源的Python程序(俄语)该程序查找对新闻源的典型引用(“根据...”,“ Facebook上的某人评论”等)并返回新闻源。
  • Displays all News Sources provided by API News by Source --> Routing to display custom news based on the source _Future Update_ --> Instead of linking to original post elsewhere, I want to set ...
  • ag_news数据集

    2019-05-28 15:10:02
    News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, ...
  • ag_news文本分类数据集

    2018-07-13 16:31:21
    News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, ...
  • ag_news_csv.tgz

    2019-05-28 15:10:03
    News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, ...
  • What is a News Aggregator?

    2004-11-04 13:11:00
    Dave Winner写的新闻聚合器的知识,原文在这里What is a... A news aggregator is "software that periodically reads a set of news sources, in one of several XML-based formats, finds the new bits, and display

    Dave Winner写的新闻聚合器的知识,原文在这里

    What is a News Aggregator? 

    A news aggregator is "software that periodically reads a set of news sources, in one of several XML-based formats, finds the new bits, and displays them in reverse-chronological order on a single page."

    .......

    展开全文
  • News note

    2021-06-29 19:21:15
    News note N E W S N O T E REQUEST FOR INFORMATION: SOCIAL SCIENCE RESEARCH SOURCES AND DATA A STUDY OF THE CONFIDENTIALITY OF The Russell Sage Foundation is funding a study of events and ...
  • const sources = news . getSources ( ) . then ( source => console . log ( sources ) ) // Wouldn't do this if I were you . catch ( ( err ) => console . error ( err ) ) ; const topHeadlines = news . ...
  • 开源数据 Open/Free Data Sources/Datasets

    千次阅读 2013-05-02 13:45:47
    最近需要为数据整合准备一些数据源,调研下来可知如今的Open Data Sources可谓五花八门,举不胜数。现将近期整理收集的Open Data Sources罗列如下: Chefmoz http://www.dmoz.org/cm_index.html 餐饮评论...

    最近需要为数据整合准备一些数据源,调研下来可知如今的Open Data Sources可谓五花八门,举不胜数。现将近期整理收集的Open Data Sources罗列如下:

           免费的金融数据:

    参考

    1. http://ideasource.blankdots.com/2012/08/open-data-sources/ 

    这篇文章除了列举了部分Open Data Sources外,它还介绍了如何创建Open Data Sources, 如何使用Open Data Sources以及一些Open Data Sources的工具。

    2. Where to Find Open Data on the Web:  http://readwrite.com/2008/04/09/where_to_find_open_data_on_the 

    3. 7 Great Websites for Free Financial and Economic Data http://www.calculatinginvestor.com/2011/02/25/free-data/

    展开全文
  • paste0( " NEWSAPI_KEY= " , NEWSAPI_KEY ), append = TRUE , fill = TRUE , file = file.path( " ~ " , " .Renviron " ) ) 安装 从Github安装 # # install script if ( ! " devtools " %in% installed.packages...
  • News Pipeline

    2020-12-25 19:42:05
    News PipelineReal Time News Scraping and Recommendation SystemBuilding RecordPOST DesignReact Frontend UINodeJS Web ServerRestFul API features (By Routing)Frontend and Backend Http Protocol(RESTful ...

    News Pipeline

    Real Time News Scraping and Recommendation System

    • Implemented a data pipeline which monitors, scrapes and dedupes latest news (MongoDB, Redis, RabbitMQ);
    • Designed data monitors for obtaining latest news from famous websites and recommend to web server.
    • Successfully fetch useful data from original news websites by building news scrapers.
    • Build dedupers which filter same news by using NLP (TF-IDF) to analyze similarities of articles scraped from news websites.
    • Use Tensorflow for machine learning which can shows news according to users interests.
      Build a single-page web.

    Building Record

    POST Design

    React Frontend UI

    NodeJS Web Server

    RestFul API features (By Routing)

    Frontend and Backend Http Protocol(RESTful API)

    Backend - SOA (Service Oriented Architrcture) Design

    Backend - MongoDB connection

    CloudAMQP: Message Queue

    Pylint (Python Coding Style Check)

    🔨 Refactor : Create an Operator to Receive all API Request from Backend Server

    • Refactor: Operations
    • [CloudAMQP_Client]
    • [Mongodb_Client]
    • [News_api_Client]
    • [News_recommendation_service_Client]

    News Data Pineline

    Monitor -> Q(scrape) -> Fetcher -> Q(dedupe)

    News Monior

    News Fetcher(Scrawler)

    News Deduper

    Authentication UI

    Authentication Logic

    Frontend - src/Auth

    React Router - With Auth

    Backend auth

    🔨 Auth Refactor

    Web Server Feature - Pagination

    Web Server Feature - Preference Model

    Web Server Feature - Click Log Processor


    React FrontEnd Build Up

    Decouple into Components

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pYBZKS6p-1608896291243)(image/app_structure.png)]

    Components:

    • Base : Whole React App (Navbar + App)
    • App : Image(title) + NewsPanel
    • NewsPanel : Concludes many NewsCard such as News Lists (While user scorlling, backend send new NewsCard continously)
    • NewsCard : Single News adding into NewsPanel with News image, News title, News contents, News description, News Tage and Links to News.(Would record the clicked by users in future)

    Create React App

    Create React App

    • Deal with webpack and give a whole framework
    • Install CRA (Global) : Local Developing Tool
    sudo npm install -g create-react-app
    
    • Create a new React App
    create-react-app top-news
    
    • Test Connection
    cd top-news
    npm start
    

    Build up APP Component

    • public : images
    • src : Each Component has its own folder
    App / App.js
    

    App.js

    • There is only one div tag in render function
    • import React , ‘./App.css’, CSS file and logo
    • Use “className” instead of “class” : Since in ES6, we use class for define a App class
    import React from 'react';
    import './App.css';
    import logo from './logo.png';
    
    class App extends React.Component {
      render() {
        return(
          <div>
            <img className = 'logo' src = {logo} alt = 'logo'/>
            <div className = 'container'>
                 {/* TODO */}
            </div>
          </div>
        ); 
      }
    }
    
    export default App;
    
    • Why use “default App”?
    • If not, while you want to import App from other file, you need to type :
    import { App } from './App.js';
    
    • But if you have default, you could get rid of {}
    • CSS setup
    .App {
      text-align: center;
    }
    
    .logo {
      display: block;
      margin-left: auto;
      margin-right: auto;
      padding-top: 30px;
      width: 20%;
    }
    

    Materialize CSS Design

    • install in Client sie
    npm install materialize-css --save
    
    • Import
    import 'materialize-css/dist/css/materialize.min.css';
    

    index.js in Client Side

    • Build a index.js for starting the client side
    touch src/index.js
    
    • index.js
    import React from 'react';
    import ReactDOM from 'react-dom';
    import App from './App/App';
    
    ReactDOM.render(
      <App />,
      document.getElementById('root')
    );
    
    • Where is root?
    • public -> index.html
    <div id="root"></div>
    

    Build up NewsPanel Component

    Save all NewsCard and connect with BackEnd

    • Create NewsPanel folder and NewsPanel.js
    mkdir src/NewsPanel
    code src/NewsPanel/NewsPanel.js
    
    import React from 'react';
    import './NewsPanel.css';
    
    • Since we Need to save the News content, we need an internal variable (need constructor)
    class NewsPanel extends React.Component {
      constructor() {
        super();
        this.state = { news: null };
      }
    
    • state = {news: null} -> lists of JSON
    • Render conditions : there is a news and then create a NewsCard or show the loading message
    render() {
        if (this.state.news) {
          return (
            <div>
              {this.renderNews()}
            </div>
          );
        } else {
          return (
            <div>
              Loading ...
            </div>
          );
    
    • local function, renderNews() : Render out the News and dynamactiy deal with the NewCards.
    • Clickable - Use A tag in HTML
    • Key - in React, if you would like to use a list, need to give a ‘key’ since the Virtual DOM need to know which items were changed in lists and just change that item insteads of renewing all items.
    • “list-group-item” needs to be put into “list-group” and show the {news_list} in list group
    • Get All News from state news -> Make all news in list -> Make all news become a NewsCard -> Put NewsCards into list-group
    renderNews() {
        const news_list = this.state.news.map(news => {
          return (
            <a className = 'list-group-item' key = {news.digest} href = '#'>
              <NewsCard news = {news} />
            </a>  
          );
        });
    
        return (
          <div className = 'container-fluid'>
            <div className = "list-group">
              {news_list}
            </div>  
          </div>
        );
      }
    
    
    • local function, loadMoreNews() : Get News from backend - init load. (Now we gave a mock data)
    loadMoreNews() {
      this.setState({
        news : [
          {....data
          }]
      });
    }
    
    • After render() was ran, it will execute componentDidMount() -> Load News in state
      componentDidMount () {
        this.loadMoreNews();
      }
    
    • Import NewsCard
    import NewsCard from '../NewsCard/NewsCard';
    
    • Export NewPanel
    export default NewsPanel;
    

    Add NewsPanel CSS

    • By default for future using
    touch src/NewsPanel/NewsPanel.css
    

    Import NewsPanel into App.js

    • App.js
    import NewsPanel from '../NewsPanel/NewsPanel';
    
    <div className = 'container'>
        <NewsPanel />
    </div>
    

    Build up NewsCard Component

    • Create NewsCard Component Folder
    mkdir src/NewsCard
    touch src/NewsCard/NewsCard.js
    src/NewsCard/NewsCard.css
    
    • class NewsCard (For HTML contents)
    class NewsCard extends React.Component {
      render() {
        return(
      HTML....
    

    HTML Structure

    • news-container
    • row
    • col s4 fill
    • image
    • col s8
    • news-intro-col
    • news-intro-panel
    • news-description
    • news-chip
    • onClick -> redirectToUrl()
      redirectToUrl(url, event) {
        event.preventDefault();
        window.open(url, '_blank');
      }
    
    • Get the data from props.news from NewsPanel.js
      <h4>
         {this.props.news.title}
      </h4>
    
    • NewsCard could get the data from NewsPanel since it was passed from :
      <a className = 'list-group-item' key = {news.digest} href = '#'>
        <NewsCard news = {news} />
      </a>  
    
    • Dont get chips if there is no source (this.props.news.source != null &&)
     {this.props.news.source != null && <div className='chip light-blue news-chip'>{this.props.news.source}</div>}
    
    • CSS file
    .news-intro-col {
      display: inline-flex;
      color: black;
      height: 100%;
    }
    
    CSS....
    

    Refactor those Components into Web Server file

    • Create a web_server file and move top-news which was renamed “client” into it
    mkdir web_server
    mv top-news/ ./web_server/client
    

    Continuous loading News (Server-Side REST API - NodeJS & Client-Side - React)

    • Deploy to AWS, there is no different likes server and client.

    • Create React App provide “Development Server” for developing, but we wont use this to serve Users

    • Development: Node Server + Development Server

    • Publishment: Node Server + build (built by React App)

    Express application generator - NodeJS Server

    Express Application Generator

    • Install Globally
    sudo npm install express-generator -g
    
    • Create a Server in web_server
    express server  //   Usage: express [options] [dir]
    
    • Install dependencies
    cd server
    npm install
    npm start
    

    Configure App.js

    (defualtly installed lots of requirements)

    • Delete :
    • bodyParser: POST Request
    • cookieParser: Authentication
    • logger: Login
    • users: Login
    • Change views engine
    • Put the default folder to /client/build
    app.set('views', path.join(__dirname, '../client/build'));
    
    • Express Static : Find the image **** Find Bug!!! -> missing: ‘/static’
    app.use('/static', 
        express.static(path.join(__dirname, '../client/build/static')));
    
    • Client Webpack: Build a build folder for server to use
    
    
    • static - css
    • static - js
    • Error Handler
    app.use(function(req, res, next) {
      res.status(404);
    });
    

    r

    • package.json : change start
      "scripts": {
        "start": "nodemon ./bin/www"
      },
    

    Server Side Routing

    index.js receive index.html from build

    • Since init run the ‘/’, redirect to the routes/ index.js
    app.use('/', index);
    
    • index.js : send index.html from build to server side
    • Get home page!
    var express = require('express');
    var router = express.Router();
    var path = require('path';)
    
    router.get('/', function(req, res, next) {
      res.sendFile("index.html", 
      { root: path.join(__dirname, '../../client/build')});
    });
    
    module.exports = router;
    
    
    • bin -> www : Place for init the App.

    RESTful API: Send Backend data from Server

    News Routes

    • In routes/news.js
    touch server/routes/news.js
    
    • Give a mock data here and send as a JSON file
    var express = require('express');
    var router = express.Router();
    
    router.get('/', function(req, res, next) {
      news = [
        .....DATA
      ];
     res.json(news);
      ]
    });
    
    module.exports = router;
    
    • In app.js require the news Route
    var news = require('./routes/news');
    app.use('/news', news);
    

    NewsPanel Requests to Backend for Loading More JSON data

    • NewsPanel.js -> loadMoreNews() with backEnd
    • Cache: False -> if true, it might show the old news from cache
    • news_url -> window.location.hostname
    • ‘http://’ + window.location.hostname + ‘:3000’ + ‘/news’
    • method: GET
    const news_url = 'http://' + window.location.hostname + ':3000' + '/news';
    
    const request = new Request(news_url, {method:'GET', cache:false});
    
    • Fetch + .then : Http Request & Promise
    • res.json -> Ansynchrons : so we need another “.then” 調用JSON
    • After we got JSON, deal with the news data
    • If there is no news on web, directly give the new one, but if not, “concat” to the old ones
    fetch(request)
        .then(res => res.json())
        .then(news => {
          this.setState({
              news: this.state.news ? this.state.news.concat(news) : news,
          });
      });
    

    Access Control Allow Origin

    Both Client and Server side localhost for developing

    • Since we couldn’t cross localhost:3000 and localhost:3001 ! Run in the different PORT.

    • Temporarily access to run in different PORT

    • (BUT NEED TO BE REMOVED WHEN FINAL PUBLISH)
    • app.js
    app.all('*', function(req, res, next) {
      res.header("Access-Control-Allow-Origin", "*");
      res.header("Access-Control-Allow-Headers", "X-Requested-With");
      next();
    });
    
    Failed to load http://localhost:3000/news: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:3001' is therefore not allowed access.
    

    Handle Scrolling

    • Keep using loadMoreNews by combining Scroll EventListener
    -             |          |  -
    document.     |          |  |
    body.         |          | ScrollY
    offestHeight  |          |  |
    |             |__________|  _
    |             |          |  -
    |             |          | window.innerHeight
    -             |__________|  _
    
    • window.innerHeight + scrollY >= document.body.offsetHeight - 50 means when touch the boundry of bottom -> load more News
    • Couldn’t use “this.loadMoreNews()” until you change handleScroll to arrow function
     window.addEventListener('scroll', () => this.handleScroll);
    
    • handleScroll()
      handleScroll() {
        const scrollY = window.scrollY 
          || window.pageYOffset
          || document.documentElement.scrollYTop;
        if((window.innerHeight) + scrollY) >= (document.body.offsetHeight - 50);
      }
    
    
    • DONT FORGET THE () -> THIS.HANDLESCROLL()
      componentDidMount() {
        this.loadMoreNews();
        window.addEventListener('scroll', () => this.handleScroll());
      }
    

    Debounce

    Lodash

    • Install Lodash inclient
    npm install lodash --save
    
    • Solve the Scroll frequent problems (Scroll Events happened too much)
    • Send several requests to backend too frequently
    import _ from 'lodash';
    
      componentDidMount() {
        this.loadMoreNews();
        this.loadMoreNews = _.debounce(this.loadMoreNews, 1000);
        window.addEventListener('scroll', () => this.handleScroll());
      }
    

    SOA (Service Oriented Architrcture)

    SOA Desgin Pattern

    • All service interfaces should be designed for both internal and external users
    Benefit:
    Isolation - language / technology / tools / 
            decoupleing / independency / deployment / maintenance
    Ownership - minimal gray area and gap
    Scalability - easy to scale up and modify
    =======
    Con:
    Complexity - sometimes unnecessary
    Latency - network communication eats time
    Test effort - all services require E2E tests
    DevOp : On-call!!!
    

    Noraml Application Design Ligic

    • Often built as a three tier architecture:
       [Desktop User]
            |
     [Presentation Tier] : Client interatcion via a web browser
            | 
        [Logic Tier] : provide the appliction's 
            |          functionality via detailed processing
       |Storage Tier|: handle persisting and retrieving application data                
    
    

    Unfortucately things get more complicated: Comflict!!!

    • Ohter types of users
    • Attachments
    • Bulk operations
    • Data pipelines
    • Notifications
    • Monitoring
    • Testing
    Mobile          Destop      UI
    User            User       Test
                \     |       /
    Chrome
    Extension  -  Presentation -    Prober
                     Tier
    File                          File
    Upload      \      |     /   Download
                     Logic    
    Notifica-   -    Tier    -  Command
    tions                       Line Tool
              /       |       \
    CSV             Storage        CSV    
    Upload            Tier        Download
              /               \
    Data                           Data
    Provider                      Consumer
    

    With SOA:

    • Fort-end Service handles all external interactions
    • Back-end implements one protocol to talk to front-end
    • All clients see same business abstraction
    • Consistent business logic enforcement
    • Easy internal refactoring

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7kuHM2NH-1608896291248)(image/SOA_structure.png)]


    RPC Backend Service

    
    || Client ||  || Node Server ||  || Backend Server ||  || Redis || || MongoDB ||  || ML Server ||
        |                 |                 | Check if in Redis  |            |              |
        |---------------> |                 |<------------------>|            |              |
        | fetch more news |---------------->|    (If not) get news from DB    |              |                 
        |(userID/ pageNum)| getNewsSunmmaire|<------------------------------->|              |
        |                 | sForUser        |       Get Recommended news from ML server      |
        |<----------------|(userID /pageNum)|<---------------------------------------------->|
        | Sliced News     |                 |Store combined news |            |              |                   
        |                 |<----------------|       in Redis     |            |              |
        |                 | Sliced News     |------------------->|            |              |
        |                 |                 |                    |            |              |
    || Client ||  || Node Server ||  || Backend Server ||  || Redis || || MongoDB ||  || ML Server ||
    
    

    BackEnd Server

    • Open a file backend_server and a service.py
    mkdir backend_server
    touch backend_server/service.py
    

    JSONRPClib Libraries

    • Build a Client or Server to send or receive RPC Request

    • Not have a good support to Python 3.5, so we need a jsonrpclib-pelix to help development
      JSONRPClib
      JSONRPClib-pelix

    • install library

    pip3 install jsonrpclib
    pip3 install jsonrpclib-pelix
    

    RPC Server - Testing

    • Server Host define
    • Server Port define
    • (Reason why we define is that in the future if we want to change that we could only change in the first line)
    • Give a function - add
    • Register host and port and your fnuctions
    from jsonrpclib.SimpleJSONRPCServer import SimpleJSONRPCServer
    
    SERVER_HOST = 'localhost';
    SERVER_PORT = 4040;
    
    def add(a, b):
      print("Add is called with %d and %d " %(a, b))
      return a + b
    
    RPC_SERVER = SimpleJSONRPCServer((SERVER_HOST, SERVER_PORT))
    RPC_SERVER.register_function(add, 'add')
    
    print("Starting RPC server")
    
    RPC_SERVER.serve_forever()
    

    Testing by Postman

    • Send a REQUEST:
    • jsonpc version
    • id : to identify
    • method : add
    • params : give a & b
    POST Request:
    {
    	"jsonrpc" : "2.0",
    	"id" : 1,
    	"method" : "add",
    	"params" : [1,2]
    }
    
    Result:
    {
        "result": 3,
        "id": 1,
        "jsonrpc": "2.0"
    }
    
    Add is called with 13 and 2
    127.0.0.1 - - [13/Jan/2018 14:48:25] "POST / HTTP/1.1" 200 -
    
    

    NodeJS Server as a RPCclient - jayson

    • Open a new folder in web_server/server/
    mkdir web_server/server/rpc_client
    
    • Change news.js server not to hard code the data here but get News from our backend server
    var express = require('express');
    var router = express.Router();
    
    
    /* GET News List. */
    router.get('/', function(req, res, next) {
      news = backend_server.getNews();
      res.json(news);
    });
    
    module.exports = router;
    
    

    Make NodeJs as a client - Npm jayson

    jayson

    • install jayson in server
    npm install jayson --save
    
    • Open a rpc_client.js with a helper method to let news.js could “getNoews()” from our backend server
    var jayson = require('jayson');
     
    // create a client
    var client = jayson.client.http({
      hostname: 'localhost',
      port: 4040
    });
     
    function add(a, b, callback) {
      client.request('add', [a, b], function(err, response) {
        if(err) throw err;
        console.log(response.result);
        callback(response.result);
      });
    }
    
    module.exports = {
      add : add
    }
    

    Wtrite a test file

    • open a rpc_client_test.js
    touch rpc_client/rpc_client_test.js
    
    • Import rpc_Client
    var client = require('./rpc_client');
    
    // invoke 'add'
    
    client.add(1, 2, function(res){
      console.assert(res == 3);
    });
    
    • How to test?
    • Open the backend server
    • Execute the rpc_client_test.js
    node rpc_client_test.js
    

    MongoDB

    • Install mongoDB (Since we need to train the data so store in the local side)
    sudo apt-get install -y mongodb-org
    
    ./mongod
    
    • Run Mongo shell
    ./mongo
    

    Mongo Syntax

    Test MongoDB (Crawling in the future)

    • show DB
    show dbs
    
    • Switch DB
    use top-news
    
    • See Collections / Tables
    show collections
    show tables
    
    • (Query Documents)[https://docs.mongodb.com/manual/tutorial/query-documents/]
    db.news.find
    db.news.fundOne()
    db.news.count()
    

    Export - Mongoexport

    • Export db
    ./mongoexport --db top-news --collection news --out demo_news_1.json
    

    Import Data from JSON file

    • Import db
    mongoimport --db top-news --collection news --file demo_news_1.json
    

    Backend Connect to MongoDB - pymongo

    pymongo

    • Install pymongo
    pip3 install pymongo
    
    • Set up all dependencies List(likes in NPM we used package.json)

    Requirements.txt

    https://api.mongodb.com/python/current/
    
    pip3 install -r requirements.txt
    

    Set up a MongoDB Client

    • open a file utils in backend_server
    mkdir utils
    touch utils/mongodb_client.py
    
    • Making a Connection with MongoClient and Getting a Database
    from pymongo import MongoClient
    
    MONGO_DB_HOST = "localhost"
    MONGO_DB_PORT = 27017
    DB_NAME = "test"
    
    client = MongoClient(MONGO_DB_HOST, MONGO_DB_PORT)
    
    def get_db(db = DB_NAME):
      db = client[db]
      return db
    

    MongoDB Test

    • Connect to MongoClient to CURD
    • Open Test File
    touch utils/mongodb_client_test.py
    
    • Set Only when user call test_basic()
    • db = client.get_db(‘test’)!!! used “_”
    import mongodb_client as client
    
    def test_basic():
      db = client.get_db('test')
      db.test.drop()
      assert db.test.count()  == 0
    
      db.test.insert({'test' : 1})
      assert db.test.count() == 1
    
      db.test.drop()
      assert db.test.count() == 0
    
      print('test_basic passed!')
    
    if __name__ == "__main__":
      test_basic()
    

    CloudAMQP

    RabbitMQ

    RabbitMQ is a message broker: it accepts and forwards messages. You can think about it as a post office: when you put the mail that you want posting in a post box, you can be sure that Mr. Postman will eventually deliver the mail to your recipient. In this analogy, RabbitMQ is a post box, a post office and a postman.
    
    The major difference between RabbitMQ and the post office is that it doesn't deal with paper, instead it accepts, stores and forwards binary blobs of data ‒ messages.
    

    CloudAMQP && Pika

    • AMQP URL is the address to receive and send the messages

    • Pika to manupulate AMQP
      pika

    [RabbitMQ Pika]

    • Install Pika by adding in requirements.txt
    pika
    
    • Make a file for CloudAMQP client
    touch backend_server/utils/cloudAMQP_client.py
    

    How CloudAMQP works

    CloudAMQP with Python

    • String -> JSON -> Serialization
    • Name of Queue based on instance thus we need to create a class
    • Parameters of URL
    • Set a socket timeout
    • Connection by Pika(blocking Connection)
    • Open a Channel for receiving message
    • Declare the channel as queue name
    class CloudAMQPClient:
      def __init__(self, cloud_amqp_url, queue_name):
        self.cloud_amqp_url = cloud_amqp_url
        self.queue_name = queue_name
        self.params = pika.URLParameters(cloud_amqp_url)
        self.params.socket_timeout = 3
        self.connection = pika.BlockingConnection(self.params)
        self.channel = self.connection.channel()
        self.channel.queue_declare(queue = queue_name)
    
    • Decode
    return json.loads(body.decode('utf-8'))
    

    Methods of sending and geting Message

    SendMessage : transer body from into String

    def sendMessage(self, message):
      self.channel.basic_publish(exchange='',
                                 routing_key = self.queue_name,
                                 body = json.dumps(message))
      print("[X] Sent message to %s:%s" %(self.queue_name, message))
    

    GetMessage: by “basic_get”

    Get a single message from the AMQP broker. Returns a sequence with the method frame, message properties, and body.
    
    Returns:	
    a three-tuple; (None, None, None) if the queue was empty; otherwise (method, properties, body); NOTE: body may be None
    
    • Geive a Delivery tag when everytime broker receive a message and return back from String to JSON
      # Get a message
      def getMessage(self):
        method_frame, header_frame, body = self.channel.basic_get(self.queue_name)
        if method_frame:
          print("[x] Received message from %s:%s" % (self.queue_name, body))
          self.channel.basic_ack(method_frame.delivery_tag)
          return json.loads(body)
        else:
          print("No message returned.")
          return None
    

    Heart Beat

    • BlockingConnection.sleep is a safer way to sleep than time.sleep().
    • This will repond to server’s heartbeat.
    def sleep(self, seconds):
        self.connection.sleep(seconds)
    

    cloudAMQP test

    • Open a test file
    touch utils/couldAMQP_client_test.py
    
    • Import CloudAMQPClient class from client, try test_basic() method with params
    from cloudAMQP_client import CloudAMQPClient
    
    CloudAMQP_URL = "amqp://xggyaoov:dudqi2kLBrreuJ-tST0uhiUcD3-rWomQ@termite.rmq.cloudamqp.com/xggyaoov"
    TEST_QUEUE_NAME = "test"
    
    def test_basic():
      client = CloudAMQPClient(CloudAMQP_URL,TEST_QUEUE_NAME )
    
      sentMsg = {"test":"test"}
      client.sendMessage(sentMsg)
    
      reveivedMsg = client.getMessage()
    
      assert sendMsg == reveivedMsg
      print("test_basic passed!")
    
    if __name__ == "__main__":
      test_basic()
    

    Backend API send Request to CloudAMQPClient API for Asking News in Queue

    • Test for get one news on service.py
    • import json and package dumps to transfer from BSON to JSON
    • Register server in this RPC Server
    This module provides two helper methods dumps and loads that wrap the native json methods and provide explicit BSON conversion to and from JSON. JSONOptions provides a way to control how JSON is emitted and parsed, with the default being the legacy PyMongo format. json_util can also generate Canonical or Relaxed Extended JSON when CANONICAL_JSON_OPTIONS or RELAXED_JSON_OPTIONS is provided, respectively.
    
    import json
    from bson.json_util import dumps
    
    def get_one_news():
      print("get_one_news is called.")
      news = mongodb_client.get_db()['news'].find_one()
      return json.loads(dumps(news))
    
    RPC_SERVER.register_function(get_one_news, 'get_one_news')
    
    
    
    
    • Import ‘utils’ (Python file import) to import mongodb_client - use os and sys
    import os
    import sys
    
    # import utils packages
    sys.path.append(os.path.join(os.path.dirname(__file__), 'utils'))
    import mongodb_client
    

    Pylint

    PEP 8 - Style Guide

    • Install PyLint
    pip3 install pylint
    
    • Analyze Outcomes
    C:  1, 0: Missing module docstring (missing-docstring)
    E: 10, 0: Unable to import 'mongodb_client' (import-error)
    C: 10, 0: Import "import mongodb_client" should be placed at the top of the module (wrong-import-position)
    C: 16, 0: Invalid argument name "a" (invalid-name)
    C: 16, 0: Invalid argument name "b" (invalid-name)
    C: 16, 0: Missing function docstring (missing-docstring)
    C: 20, 0: Missing function docstring (missing-docstring)
    

    How to slove

    • missing-docstring : Add docstring -> “”“XXX”""
    • import-error: In our case, we need to write an exception to surpass the error
    • wrong-import-position: Need to put on the top
    • invalid-name : couldn’t use argument named a -> num1
    • bad-whitespace: 1, 2

    Refactor: Operations

    • Get One News need to be put other files to let service become a simple surface just receive the API request
    • open a file “operations.py”
    import os
    import sys
    import json
    from bson.json_util import dumps
    
    # import utils packages
    sys.path.append(os.path.join(os.path.dirname(__file__), 'utils'))
    
    import mongodb_client
    
    NEWS_TABLE_NAME = "news"
    
    def getOneNews():
        db = mongodb_client.get_db()
        news = db[NEWS_TABLE_NAME].find_one()
        return json.loads(dumps(news))
    
    • Import operations in service.pu
    import operations
    

    Refactor : Let Utils be used by Both Backend Server and Data pipeline

    mkdir common
    mv backend_server/utils/* common/
    mv backend_server/requirements.txt ./
    rmdir backend_server/utils
    
    • Chagne the path from service.py and operations.py from utils to common
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'common'))
    

    News Pipeline

    • News API
    • New Monitor(Thread) : Through News API to get the latest News URL, run every 10 sec
    • Redis(Save collection News) : To solve the duplicated problems, if it has been collected, we’ll ignore that News.
    • RabbitMQ : Receive the accepted News URL from News Monitor and Send it to Web Scrapers
    • Web Scrapers : Receive News URL and scrape the contenct(XPath) from the website
    • RabbitMQ : Receive the New Contents from Web Scrapers
    • NewS Deduper : Receive the scraped News from RabbitMQ and Filter the same contents News by using NLP - TLITF

    Steps

      1. News Monitor
      1. News Fetcher - XPath
      1. News Deduper
      1. News Fetcher - third party package (Replace XPtah)

    News API

    News API

    Request

    https://newsapi.org/v2/top-headlines?country=us&category=business&apiKey=715e9632a2a94ea1a4546e3f314a76a5
    
    • source :
    • apiKey :
    status": "ok",
    "totalResults": 20,
    -"articles": [ ...
    
    touch common/news_api_client.py
    
    • Install requests and add in requirements.txt
    pip3 install requests
    
    • getNewsFromSource()
    • private buildUrl
    • DEFAULT_SOURCES / SORT_BY_TOP
    • Response is a String and we need to transfer it into a JSON and decode into utf-8
    import requests
    from json import loads
    
    DEFAULT_SOURCES = [CNN]
    CNN = 'cnn'
    SORT_BY_TOP = 'top'
    NEWS_API_KEY = '715e9632a2a94ea1a4546e3f314a76a5'
    NEWS_API_ENDPOINT = "https://newsapi.org/v1/"
    ARTICLES_API = "article"
    
    def _buildUrl(endPoint = NEWS_API_ENDPOINT, apiName = ARTICLES_API):
        return endPoint + apiName
     
    def getNewsFromSource(sources = DEFAULT_SOURCES, sortBy = SORT_BY_TOP):
        articles = []
    
        for source in sources:
            payload = {'apiKey' : NEWS_API_KEY,
                       'source' : source,
                       'sourBy' : sortBy} 
            response = requests.get(_buildUrl(), params = payload)
            res_json = loads(response.content.decode('utf-8'))
    
    • To see if the response is vaild
    • status -> ok, source and res_json not None
    • Populate news source into articles : Add Soucre into the result
    .....'publishedAt': '2018-01-14T10:36:26Z', 'source': 'cnn'}]
    
     # Extract news from response
            if (res_json is not None and
                res_json['status'] == 'ok' and
                res_json['source'] is not None):
                # populate news source in each articles.
                for news in res_json['articles']:
                    news['source'] = res_json['source']
                articles.extend(res_json['articles'])
    
        return articles
    

    News Api test

    • test_basic()
    • use getNewsFromSource, makes sure the ammount of news > 0 and tries another sources
    import news_api_client as client
    
    def test_basic():
        news = client.getNewsFromSource()
        print(news)
        assert len(news) > 0
        news = client.getNewsFromSource(sources=['cnn'], sortBy='top')
        assert len(news) > 0
        print('test_basic passed!')
    
    if __name__ == "__main__":
        test_basic()
    

    News Monitor

    • Connect with Redis
    • Connect with RebbitMQ
    • Connect with News API Client

    Sent to Redis

    • Install Redis
    pip3 install redis
    
    • News Monitor

    • number_of_news to record the number of news

    • Record: title / description / text / url / author / source / publishedAt:date / urlToImage / class / digest

    • What digest to be used? To see if there is a dupilcate in Redis by transfering Digest into a Hashlib which could save the space in Redis

    • Others, we could use it in React Frontend

    • Add back digest to News JSON

    """News Monitor"""
    import hashlib
    import redis
    import os
    import sys
    
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'common'))
    import news_api_client
    
    NEWS_SOURCES = "cnn"
    
    while True:
        news_list = news_api_client.getNewsFromSource(NEWS_SOURCES)
    
        number_of_news = 0
    
        for news in news_list:
            news_diget = hashlib.md5(news['title'].encode('utf-8')).hexdigest()
    
    • Connect Redis and use it to find if there is in Redis or not
    REDIS_HOST = 'localhost'
    REDIS_PORT = 6379
    redis_client = redis.StrictRedis(REDIS_HOST, REDIS_PORT)
    
      if redis_client.get(news_digest) is None:
          number_of_news += 1
    
    • Deal with the publishAt problems. Since some news didn’t get the publishAt data but we need that to sort the News. Thus, we use the datetime we got that news to represent the publishAt time
    "publishedAt": "2018-01-14T20:17:50Z"
    
       import datetime
       
        for news in news_list:
            news_diget = hashlib.md5(news['title'].encode('utf-8')).hexdigest()
            # Connect with Redis and check if it's in Redis
            if redis_client.get(news_digest) is None:
                number_of_news += 1
                # Deal with publishAt problems
                if news['publishedAt'] is None:
                    news['publishedAt'] = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
                # Save into Redis
                redis_client.set(news_digest, "True")
                redis_client.expire(news_digest, NEWS_TIME_OUT_IN_SECONDS)
    

    Send to RabbitMQ

    • init and import CloudAMQP Client
    • Need to apply anotehr QUEUE different from TEST URL
    
    from cloudAMQP_client import CloudAMQPClient
    
    SCRAPE_NEWS_TASK_QUEUE_URL = 
    SCRAPE_NEWS_TASK_QUEUE_NAME = "top-news-scrape-news-task-queue"
    
    SLEEP_TIME_IN_SECOND = 10
    cloudAMQP_client = CloudAMQPClient(SCRAPE_NEWS_TASK_QUEUE_URL, SCRAPE_NEWS_TASK_QUEUE_NAME)
    
                # Send Tasks to cloudAMQP
                cloudAMQP_client.sendMessage(news)
    
        print("Fetched %d news." % number_of_news)
    
        cloudAMQP_client.sleep(SLEEP_TIME_IN_SECOND)
    

    Stock in cloudAMQP Problems!(sloved - Server issue)

    pika.exceptions.ProbableAuthenticationError: (403, 'ACCESS_REFUSED - Login was refused using authentication mechanism PLAIN. For details see the broker logfile.')
    

    Tool for Clean Queue

    import os
    import sys
    
    # import common package in parent directory
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'common'))
    
    from cloudAMQP_client import CloudAMQPClient
    
    SCRAPE_NEWS_TASK_QUEUE_URL = "amqp://cbzkwlek:4louH2OEYrE66kGmwv8RmLiOC2JZyhSi@donkey.rmq.cloudamqp.com/cbzkwlek"
    SCRAPE_NEWS_TASK_QUEUE_NAME = "top-news-SCRAPE_NEWS_TASK_QUEUE"
    
    # DEDUPE_NEWS_TASK_QUEUE_URL = #TODO: use your own config.
    # DEDUPE_NEWS_TASK_QUEUE_NAME = #TODO: use your own config.
    
    def clearQueue(queue_url, queue_name):
        scrape_news_queue_client = CloudAMQPClient(queue_url, queue_name)
    
        num_of_messages = 0
    
        while True:
            if scrape_news_queue_client is not None:
                msg = scrape_news_queue_client.getMessage()
                if msg is None:
                    print("Cleared %d messages." % num_of_messages)
                    return
                num_of_messages += 1
    
    
    if __name__ == "__main__":
        clearQueue(SCRAPE_NEWS_TASK_QUEUE_URL, SCRAPE_NEWS_TASK_QUEUE_NAME)
        # clearQueue(DEDUPE_NEWS_TASK_QUEUE_URL, DEDUPE_NEWS_TASK_QUEUE_NAME)
    

    Web Scrapers

    XPath

    • XPath Helper
    """//p[contains(@class, 'zn-body__paragraph')]//text() | //div[contains(@class, 'zn-body__paragraph')]//text()"""
    

    Scrapers

    • Open a scrapers folder and cnn news scraper file
    • Imitate the behaviors of browsers
    • session & header
    • Imitate a real User Agent as a Header
    def extract_news(news_url):
        session_requests = requests.session()
        response = session_requests.get(news_url, headers=_get_headers())
        news = {}
    
    • Get the Header by looping the Mock User Agent File
    def _get_headers():
        ua = random.choice(USER_AGENTS)
        headers = {
          "Connection" : "close", 
          "User-Agent" : ua
        }
        return headers
    
    • Import html from lxml
    • Used XPATH method by separating it into tree and news from the tree
    • Join the LIST of news together to become a whole STRING
    from lxml import html
    try:
            tree = html.fromstring(response.content)
            news = tree.xpath(GET_CNN_NEWS_XPATH)
            news = ''.join(news)
        except Exception:
            return {}
    
        return news
    
    
    • Grab the Agent info form file and randomly select one of them
    # Load user agents
    USER_AGENTS_FILE = os.path.join(os.path.dirname(__file__), 'user_agents.txt')
    USER_AGENTS = []
    
    with open(USER_AGENTS_FILE, 'rb') as uaf:
        for ua in uaf.readlines():
            if ua:
                USER_AGENTS.append(ua.strip()[1:-1])
    
    random.shuffle(USER_AGENTS)
    

    News Fetcher

    • Take a News Url from Queue(news monitor) and use scraper to get the contents and send it into the next Queue
    • While loop likes Moniter, get a message from scrape_news_queue_client and use handle message
    while True:
        if scrape_news_queue_client is not None:
            msg = scrape_news_queue_client.getMessage()
            if msg is not None:
                # Parse and process the task
                try:
                    handle_message(msg)
                except Exception as e:
                    print(e)
                    pass
            scrape_news_queue_client.sleep(SLEEP_TIME_IN_SECONDS)
    
    • handleMessage()
    def handle_message(msg):
        if msg is None or not isinstance(msg, dict):
            print('message is broken')
            return
    
        task = msg
        text = None
    
    • Check if the source is from cnn, and extractNews by task[‘url’] from news scraper and Rewrite the text into task
        if task['source'] == 'cnn':
            print('scraping CNN news')
            text = cnn_news_scraper.extractNews(task['url'])
        else
            print('News source [%s] is not supported. ' % task['source'])
        
        task['text'] = text
    
    • Send out the task to dedupe Queue
        dedupe_news_queue_client.sendMessage(task)
    
    • Import os,sys and the CloudAMQP CLIENT
    import os
    import sys
    
    
    # import common package in parent directory
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'common'))
    
    from cloudAMQP_client import CloudAMQPClient
    
    DEDUPE_NEWS_TASK_QUEUE_URL = 
    DEDUPE_NEWS_TASK_QUEUE_NAME = "top-new-DEDUPE_NEWS_TASK_QUEUE_NAME"
    SCRAPE_NEWS_TASK_QUEUE_URL = "
    SCRAPE_NEWS_TASK_QUEUE_NAME = "top-news-SCRAPE_NEWS_TASK_QUEUE"
    
    SLEEP_TIME_IN_SECONDS = 5
    
    dedupe_news_queue_client = CloudAMQPClient(DEDUPE_NEWS_TASK_QUEUE_URL, DEDUPE_NEWS_TASK_QUEUE_NAME)
    scrape_news_queue_client = CloudAMQPClient(SCRAPE_NEWS_TASK_QUEUE_URL, SCRAPE_NEWS_TASK_QUEUE_NAME)
    

    TEST Monitor and Fetcher

    • Clear Redis
    redis-cli flushall
    
    • Queue_helper (if needed)
    python3 queue_helper.py
    
    • Get News URL From News Api (Faster)
    python3 news_monitor.py
    
    • Get News URL and Scrape on website
    python3 news_fetcher.py
    

    News Deduper - TFIDF

    sklearn - Python Package for ML

    • With the dependencies:
    • numpy
    • scipy
    • python-dateutil : Judge the News published date to compare
    pip3 install sklearn
    pip3 install numpy
    pip3 install scipy
    pip3 install python-dateutil
    

    TFIDF Vectorizer - Test

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    doc1 = "I like apples. I like oranges too"
    doc2 = "I love apples. I hate doctors"
    doc3 = "An apple a day keeps the doctor away"
    doc4 = "Never compare an apple to an orang"
    
    documents = [doc1, doc2, doc3, doc4]
    
    tfidf = TfidfVectorizer().fit_transform(documents)
    pairwise_sim = tfidf * tfidf.T
    
    print(pairwise_sim.A)
    
    • Outcomes :
    [[ 1.          0.12693309  0.          0.        ]
     [ 0.12693309  1.          0.          0.        ]
     [ 0.          0.          1.          0.27993128]
     [ 0.          0.          0.27993128  1.        ]]
    

    If the number > 0.8 , we could recognize those news as the same an just ignore them

    Deduper

    • Get the news from Queue and Analyze if the TFIDF value greater than 0.8 and dont add in the DB
    • Go through cloudAMQP to get message
    while True:
        if cloudAMQP_client is not None:
            msg = cloudAMQP_client.getMessage()
            if msg is not None:
                # Paese and process the task
                try:
                    handle_message(msg)
                except Exception as e:
                  print(e)
                  pass
            
            cloudAMQP_client.sleep(SLEEP_TIME_IN_SECONDS)
    
    • Handle the message and get all recent news based on publishedAt attribute
    def handle_message(meg):
        if msg is None or not isinstance(msg, dict) :
            return
        task = msg
        text = task['text']
        if text is None:
            return
    
    • Parser the String to datetime util to compare
    • give a begin time and end time(+1days)
    publised_at = parser.parse(task['publishedAt'])
        published_at_day_begin = datetime.datetime(published_at.year, published_at.month, published_at.day, 0, 0, 0, 0)
        published_at_day_end = published_at_day_begin + datetime.timedelta(days=1)
    
    • Connect with MongoDB
    • Get the same day news by their begin and end time (greater and less than)
    # import common package in parent directory
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'common'))
    
      db = mongodb_client.get_db()
        same_day_news_list = list(db[NEWS_TABLE_NAME].find({'publishedAt': {'$gte': published_at_day_begin, '$lt': published_at_day_end}}))
    
    
    • Used TFIDF to count if the News should be put together
    • Only sage the text
    if same_day_news_list is not None and len(same_day_news_list) > 0:
            documents = [news['text'] for news in same_day_news_list]
            documents.insert(0, text)
    
    • TFIDF : we only need to see the row, if the one of the value in row is > SAME_NEWS_SIMILARITY_THRESHOLD = 0.9 , ignore the news
           # Calculate similarity matrix
            tfidf = TfidfVectorizer().fit_transform(documents)
            pairwise_sim = tfidf * tfidf.T
    
            print(pairwise_sim)
    
            rows, _ = pairwise_sim.shape
    
            for row in range(1, rows):
                if pairwise_sim[row, 0] > SAME_NEWS_SIMILARITY_THRESHOLD:
                    # Duplicated news. Ignore.
                    print("Duplicated news. Ignore.")
                    return
    
    • Save other News
    • CHange the publishedAt type to the MongoDB friendly datetime Schema
    • Used digest to be the Key’s Id
        task['publishedAt'] = parser.parse(task['publishedAt'])
    
        db[NEWS_TABLE_NAME].replace_one({'digest': task['digest']}, task, upsert=True)
    

    Newspaper 3k

    Newspaper3k

    • Used this library insteads of our scraper
    • Since if we would like to get the news from different sources, we need to analyze each pages and get the structure of XPath
     pip3 install newspaper3k
    

    Change the News Fetcher

    from newspaper import Article
    
        article = Article(task['url'])
        article.download()
        article.parse()
        task['text'] = article.text
    

    Change the source when could save in News Monitor

    NEWS_SOURCES = [
        'bbc-news',
        'bbc-sport',
        'bloomberg',
        'cnn',
        'entertainment-weekly',
        'espn',
        'ign',
        'techcrunch',
        'the-new-york-times',
        'the-wall-street-journal',
        'the-washington-post'
    ]
    

    Authentication

    • Separate Logic and UI

    LoginForm

    const LoginForm = ({
      onSubmit,
      onChange,
      errors,
      user,
    }) => (
        html codes
    );
    
    • onSubmit : function
    • onChange : function
    • errors : value - deal with the show of error message
    • user : value
    Login title
    Error message
    Text Field (Account)
    Error message
    Text Field (Password)
    Error Meessage
    Submit button
    Sign up
    
    • Error Message : if ther is a errors.summary, then show the
     {errors.summary && <div className="row"><p className="error-message">{errors.summary}</p></div>}
    
    • Submit button
    <input type="submit" className="waves-effect waves-light btn indigo lighten-1" value='Log in'/>
    
    • Check parameters: When Using LoginForm, we need four params (onSubmit, onChange, errors, user), if not we will return an error
    
    LoginForm.propTypes = {
       onSubmit: PropTypes.func.isRequired,
       onChange: PropTypes.func.isRequired,
       errors: PropTypes.object.isRequired,
       user: PropTypes.object.isRequired  
    }
    

    Login Page

    • render -> onSubmit / onChange / errors / user
        render() {
          return (
            <LoginForm
              onSubmit = {(e) => this.processForm(e)}
              onChange = {(e) => this.changeUser(e)}
              errors = {this.state.errors}
              user = {this.state.user}
            />
    
    • Constructor : errors, user(email, password)
      constructor(props) {
        super(prop);
        this.state = {
          errors: {},
          user: {
            email: '',
            password: ''
          }
        };
      }
    
    • processForm : get the states of user email and password
      processForm(event) {
        event.preventDefault();
        const email = this.state.user.email;
        const password = this.state.user.password;
    
        console.log('email: ' + email);
        console.log('password: ' + package);
    
        // TODO: post login data
      }
    
    • changeUser : if user input email or password ,give a new value to
      changeUser(event) {
        const field = event.target.name;
        const user = this.state.user;
        user[field] = event.target.value;
    
        this.setState({user});
    
      }
    

    Index -> Import Login Page

    import React from 'react';
    import ReactDOM from 'react-dom';
    import App from './App/App';
    import LoginPage from './Login/LoginPage';
    
    ReactDOM.render(
      <LoginPage />,
      document.getElementById('root')
    );
    

    Script Tag in index.js

    • Then we will face the Problem between jQuery and React (Since our materical need a jQuery to give the styling)

    • In public / index.html, add in Head to import jQuery CDN

    <script src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
    
    • App.js import materizlize js
    import 'materialize-css/dist/css/materialize.min.css';
    import 'materialize-css/dist/js/materialize.min.js';
    

    SignUpForm

    -Add a password comfirm

     <div className="row">
              <div className="input-field col s12">
                <input id="confirm_password" type="password" name="confirm_password" className="validate" onChange={onChange}/>
                <label htmlFor="confirm_password">Conform Password</label>
              </div>
            </div>
    

    SignUpPage

    • Add a password comfirm
        this.state = {
          errors: {},
          user: {
            email: '',
            password: '',
            confirm_password: ''
          }
        };
    
    if (password !== confirm_password) {
    return;
    }
    
    

    Authentication Implementation

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NeJZfmIf-1608896291251)(./image/Auth.png)]

    前端:

    1. 只能判斷 localStorage 是否有 token,但實際判斷用戶身份,還是握在Server端
    2. SignUp Page: 直接Post Request給後端
    3. Login Page: 直接 Post Request 給後端
    4. Base: React Router 若 User 沒有登入 redirect 到 Login

    後端:

    1. 處理 SignUp 的 Request,檢查input,將密碼 salt + Hash 之後,將用戶email & password 存入 DB (validator + bcrypt + passport)
    2. 處理 Login的 Request,比對密碼 (passport)
      ( passport + mongoose 處理所有 DataBase間的連接,與密碼比對)
    3. loadMoreNews() 被調用時,會透過 auth_checker 檢查存在前端的 token 是否正確 (jwt)

    Login

    router| app.post('auth/login')
    
    Normal| validateSignupForm()
    
    Normal| passport.authenticate()
    
    Passport|'local-login' strategy
    
    Token| sign and return token
    
    

    SignUp

    router|app.post('auth/signup')
    
    Normal|validateLoginForm()
    
    Normal|passport.authenticate()
    
    Passport|'local-signup' strategy
    

    Web

    XXXX| loadMoreNews (with token)
    
    Token| authChecker
    
    Token| verify token
    
    router| app.get('news')
    
    Normal|return news
    

    FrontEnd Auth

    JWT 用戶瀏覽器上保存 token

    • 登入成功後給個token,存在cache中
    • client/src/Auth
    • Auth.js 見一個類,控制讀寫LocalStorage內的Token,HashMap操作
    • Show user email and use email as account
    • isUserAuthenticated 用來判斷有無token字段,token是否正確由後端處理
      static authenticateUser(token, email) {
        localStorage.setItem('token', token);
        localStorage.setItem('email', email);
      }
    
      static isUserAuthenticated() {
        return localStorage.getItem('token') !== null;
      }
    
      static deauthenticateUser() {
        localStorage.removeItem('token');
        localStorage.removeItem('email');
      }
    
      static getToken() {
        return localStorage.getItem('token');
      }
    
      static getEmail() {
        return localStorage.getItem('email');
      }
    
    

    JWT and Salt

    SHA1 wiht SALT

    只能從Salt推到Hash

    Signup: f(password, salt) = hash(password + salt)

    • For each user, we generate a random a salt and add it to user’s password. Server-side generated password.
      Login: hash([provided password] + [stored salt]) == [stored hash]
    • Then the user is authenticated.

    Base Component with Login and SignUp

    • 用戶沒登入:顯示LoginPage
    (<div>
        <li><Link to="/login">Log in</Link></li>
        <li><Link to="/signup">Sign up</Link></li>
        </div>)
    
    • 用戶已登入:顯示App UI
    (<div>
        <li>{Auth.getEmail()}</li>
        <li><Link to="/logout">Log out</Link></li>
        </div>)
    
    • navbar :
    • 判斷 Aith
    {Auth.isUserAuthenticated() ? A : B
    
    • 把權限上交給React Router
     {children}
    
    • 超連結 for React Router </Link to ="">
    <li><Link to="/login">Log in</Link></li>
    

    React Router in Client

    npm install -s react-router@"<4.0.0"
    

    routes .js

    • 用戶在根目錄時:顯示新聞,判斷是否註冊,使用isUserAuthenticated(),callback -> App / LoginPage
    {
          path: '/',
          getComponent: (location, callback) => {
            if (Auth.isUserAuthenticated()) {
              callback(null, App);
            } else {
              callback(null, LoginPage);
            }
          }
        },
    
    • Login / SingUp
       {
          path: "/login",
          component: LoginPage
        },
    
        {
          path: "/signup",
          component: SignUpPage
        },
    
    
    • Logout : deauthenticateUser()
        {
          path: '/logout',
          onEnter: (nextState, replace) => {
            Auth.deauthenticateUser();
    
            // change the current URL to /
            replace('/')
          }
    

    Index.js

    import ReactDom from 'react-dom';
    import React from 'react';
    
    import { browserHistory, Router } from 'react-router';
    import routes from './routes';
    
    ReactDom.render(
      <Router history={browserHistory} routes={routes} />,
      document.getElementById('root')
    );
    

    Server Side Auth

    App.js

    • npm cors : 只需要Import後,直接使用,但僅限於開發環境才允許跨域
    npm cors
    

    LoginPage/SignUpPage - TODO 實際發送數據

    • Server上將有個新的Api處理註冊與登入
    • LoginPage / SignUpPage
    const url = 'http://' + window.location.hostname + ':3000' + '/auth/signup';
    
     const url = 'http://' + window.location.hostname + ':3000' + '/auth/login';
    
    
    • 發送POST請求,必須是一個String化的JSON
     const url = 'http://' + window.location.hostname + ':3000' + '/auth/login';
        const request = new Request(
          url,
          {
            method: 'Post',
            headers: {
              'Accept': 'application/json',
              'Content-Type': 'application/json',
            },
            body: JSON.stringify({
              email: this.state.user.email,
              password: this.state.user.password
            })
          }
        );
    
    • Fetch,先確認收到的res是正確的 ==200
    fetch(request).then(response => {
          if (response.status === 200) {
            this.setState({
              errors: {}
            });
    
            response.json().then(json => {
              console.log(json);
              Auth.authenticateUser(json.token, email);
              this.context.router.replace('/');
            });
    
    • React Router
    this.context.router.replace('/');
    
    • 錯誤時:
    else {
    console.log('Login failed.');
    response.json().then(json => {
        const errors = json.errors? json.errors : {};
        errors.summmary = json.message;
        this.setState({errors});
    });
    }
    });
    

    How to use props / Context / Update

    ReactBaseClasses

    Desing error from React

    • To make react-router work.
    LoginPage.contextTypes = {
      router: PropTypes.object.isRequired
    };
    

    Server for getting DB (validateLogin/SignUp/Passport)

    {
      "mongoDbUri": "mongodb://test:test@ds058579.mlab.com:58579/cs503",
      "jwtSecret": "a secret phrase!!"
    }
    
    npm install -s mongoose
    

    mongoose 處理用戶存取帳號密碼的保密

    使用Schema做Mapping,存取兩個字段 email 和 password。

    • UserSchema 加入 comparePassword(),驗證用戶提交的密碼是否和儲存的密碼相等,用bcrypt.compare來比較。
    UserSchema.methods.comparePassword = function comparePassword(password, callback) {
      bcrypt.compare(password, this.password, callback);
    };
    

    bcrypt - Salt and Hash

    • 可能遇到的狀況,如果用戶名和password沒有更改,無法繼續,接著生成一個Salt(genSalt),如果有saltError直接return,如果沒問題,使用salt和用戶的password做hash,如果有hashError再return,最後將user password賦值 hash(最重要一步)。
    UserSchema.pre('save', function saveHook(next) {
      const user = this;
    
      // proceed furhter only if the password is modified or the user is new.
      if (!user.isModified('password')) return next();
    
      return bcrypt.genSalt((saltError, salt) => {
        if (saltError) { return next(saltError); }
    
        return bcrypt.hash(user.password, salt, (hashError, hash) => {
          if (hashError) { return next(hashError); }
    
          // replace a password string with hashed value.
          user.password = hash;
    
          return next();
        });
      });
    });
    

    Main.js

    • 讓UserSchema初始化並連接到MongoDB中。

    App.js -> Routing 連接到 config

    var config = require('./config/config.json');
    require('./models/main.js').connect(config.mongoDbUri);
    

    Login Passport

    Passport & Passport-Local & JWT

    npm install -s passport
    npm install -s passport-local
    npm install -s jsonwebtoken
    
    • 導出PassportLocalStrategy,裡面有usernameField / passwordField ,其他並不重要,按照document上操作。

    • 利用email找出User,利用User.findOne -> mongoose中定義的UserSchema,如果連接不上MongoDB,會提出error

    • 如果!user,找不到用戶,返回一個錯誤訊息,切記不要返回任何User信息。

    const error = new Error('Incorrect email or password');
    
    • 比對 userPassword -> user.comparePassWord 剛才在user.js裡面寫好的method
       // check if a hashed user's password is equal to a value saved in the database
        return user.comparePassword(userData.password, (passwordErr, isMatch) => {
          if (passwordErr) { return done(passwordErr); }
    
          if (!isMatch) {
            const error = new Error('Incorrect email or password');
            error.name = 'IncorrectCredentialsError';
    
            return done(error);
          }
    
    • MongoDB自動生成一個id,並把token配給該userid,會返回 name: user.email ,但其實前端並不依賴後端返回該內容
          const payload = {
            sub: user._id
          };
    
          // create a token string
          const token = jwt.sign(payload, config.jwtSecret);
          const data = {
            name: user.email
          };
    

    SignUp Passport

    • 一樣使用PassportLocalStrategy,只要去MongoDB查找是否有相同的User,或是可以嘗試添加一個email,如果加入失敗了會返回error,代表用戶存在,如果成功加入就正常返回。

    綁定兩個LocalStratgy 到 app.js

    • 在 Auth 的API後才會真正使用,現在先綁定
    var passport = require('passport');
    
    app.use(passport.initialize());
    var localSignUpStrategy = require('./passport/signup_passport');
    var localLoginStrategy = require('./passport/login_passport');
    passport.use('local-signup', localSignUpStrategy);
    passport.use('local-login', localLoginStrategy);
    
    

    Middleware

    • auth_checker
    • 接了req之後,生成res,讓下一個人使用(express框架)。
    • jwt.verify 用來解開 token,讓userid = decoded.sub,按照這個id去MongoDB中findById,如果確認有這位用戶,就正常到next()
    • 為什麼是middleware?驗證完了,才與許用戶去看新聞,所以得在user調用news之前。
    const authChecker = require('./middleware/auth_checker');
    
    app.use('/', index);
    app.use('/auth', auth);
    app.use('/news', authChecker);
    app.use('/news', news);
    

    Body Parser

    • 把用戶POST的String,轉化成JSON
    npm install --save body-parser
    
    var bodyParser = require('body-parser');
    app.use(bodyParser.json());
    

    Auth API

    Validator

    • 用來驗證用戶的輸入,有些惡意攻擊,必須在使用之前先做個validate,除了英語也支持很多語言。
    npm install --s validator
    
    • validateSignupForm 檢查用戶輸入是否為String,內部的error要轉化成外部的Error Message
    function validateSignupForm(payload) {
      console.log(payload);
      const errors = {};
      let isFormValid = true;
      let message = '';
    
      if (!payload || typeof payload.email !== 'string' || !validator.isEmail(payload.email)) {
        isFormValid = false;
        errors.email = 'Please provide a correct email address.';
      }
    
      if (!payload || typeof payload.password !== 'string' || payload.password.length < 8) {
        isFormValid = false;
        errors.password = 'Password must have at least 8 characters.';
      }
    
      if (!isFormValid) {
        message = 'Check the form for errors.';
      }
    
      return {
        success: isFormValid,
        message,
        errors
      };
    }
    

    app.js

    var auth = require('./routes/auth');
    app.use('/auth', auth);
    

    WebServer Features

    • Pagination
    • Preference Model
    • Click Log Processor

    Pagination

    • process of dividing a document into discrete pages
    • User doesn’t need all data
    • User cannot handle too much data
    • System cannot handle too much data
    • User cannnot wait too long(or browser freeze)

    Client - Side Pagination

    • No change on backend: backend sends all data to client
    • Pro: easy as no backend assistance
    • Cons: client would be slow due to I/O and memory consumption
    • 第一次就把所有新聞傳給用戶,針對Client端的數據做Slice呈現

    Server - Side Pagination

    • Server sneds paginated data to client
    • Pros: much better user experience
    • Cons: Extra work and storage on backend; Need coordination between client and server
    • Client每次

    Backend Server (Web Server doesn’t deal with business Logic)

    • new Function
    • api used _ , in opeartion funcion used camelCase
    def get_news_summaries_for_user(user_id, page_num):
        print("get_news_summaries_for_user is called with %s and %s" %(user_id, page_num))
        return operations.getNewsSummaries(user_id, page_num)
    

    Operations.py

    • used pickle : make JSON(dic) becomes the String that could be read by Redis
    
    REDIS_HOST = "localhost"
    REDIS_PORT = 6379
    
    redis_client = redis.StrictRedis(REDIS_HOST, REDIS_PORT, db=0)
    
    • Need a reide to store News (with begin index)
    • NEWS_LIST_BATCH_SIZE = 10 means the number of news sent everytime when client requests
    • If the News isn’t in the redis, go searching in MongoDB and give a new limit by sort News depends on 'publishedAt;
    • pickle.dump to transfer digest (As a key to search in MongoDB)
        page_num = int(page_num)
        begin_index = (page_num - 1) * NEWS_LIST_BATCH_SIZE
        end_index = page_num * NEWS_LIST_BATCH_SIZE
    
        sliced_news = []
    

    If News had already in Redis

     if redis_client.get(user_id) is not None:
            total_new_digests = pickle.loads(redis_client.get(user_id))
            sliced_news_digests = total_new_digests[begin_index:end_index]
            db = mongodb_client.get_db()
            sliced_news = list(db[NEWS_TABLE_NAME].find({'digest' : {'$in': sliced_news_digests}}))
    

    If News didn’t not in Redis

        else:
            db = mongodb_client.get_db()
            total_news = list(db[NEWS_TABLE_NAME].find().sort([('publishedAt', -1)]).limit(NEWS_LIMIT))
            total_news_digest = [x['digest'] for x in total_news]
            redis_client.set(user_id, pickle.dumps(total_news_digest))
            redis_client.expire(user_id, USER_NEWS_TIME_OUT_IN_SECONDS)
    
            sliced_news = total_news[begin_index:end_index]
    

    Operations Test

    • Test if it could request the News
    def test_getNewsSummariesForUser_basic():
        news = operations.getNewsSummariesForUser('test', 1)
        assert len(news) > 0
        print('test_getNewsSummariesForUser_basic passed')
    
    • Test the pagination : If page 1 and 2 could be requested and If there is the same digest of News appeared in two pages(fail)
    def test_getNewsSummariesForUser_pagination():
        news_page_1 = operations.getNewsSummariesForUser('test', 1)
        news_page_2 = operations.getNewsSummariesForUser('test', 2)
    
        assert len(news_page_1) > 0
        assert len(news_page_2) > 0 
    
        digests_page_1_set = set(news['digest'] for news in news_page_1)
        digests_page_2_set = set(news['digest'] for news in news_page_2)
    
        assert len(digests_page_1_set.intersection(digests_page_2_set)) == 0
    
        print('test_getNewsSummariesForUser_pagination passed')
    

    Connect with FrontEnd - RPC Client(in web server)

    • Call the backend server by API (getNewsSummariesForUser)
    function getNewsSummariesForUser(user_id, page_num, callback) {
      client.request('getNewsSummariesForUser', [user_id, page_num], function(err, response){
        if(err) throw err;
        console.log(response.result);
        callback(response.result);
      });
    }
    
    module.exports = {
      add : add,
      getNewsSummariesForUser: getNewsSummariesForUser
    }
    
    
    • Test getNewsSummariesForUser
    // invoke "getNewsSummariesForUser"
    client.getNewsSummariesForUser('test_user', 1, function(response) {
      console.assert(response != null);
    });
    

    Refactor the Get News API

    • To get data by calling API
    // "localhost:3000/news/userId/1@1.com/pageNum/2"
    router.get('/userId/:userId/pageNum/:pageNum', function(req, res, next) {
      console.log('Fetching news...');
      user_id = req.params['userId'];
      page_num = req.params['pageNum'];
    
      rpc_client.getNewsSummariesForUser(user_id, page_num, function(response) {
        res.json(response);
      });
    });
    

    Change Client - NewsPanel.js

    • Save the pageNum start from 1
    • Status of loadedAll -> if true, we wont send request to server
    this.state = { news:null, pageNum:1, loadedAll:false};
    
        //    if (!news || news.length == 0) {
        //       this.setState({loadedAll:true});
        //     }
    
    • loadMoreNews()
    • Check the state first
    • Auth.getEmail() to get userID
    • Auth.getToken()
    • this.state.pageNum to get pageNum
    • Change the State while loading new pageNum
     loadMoreNews() {
        if (this.state.loadedAll == true) {
          return;
        }
    
        const news_url = 'http://' + window.location.hostname + ':3000' +
            '/news/userId/' + Auth.getEmail() + '/pageNum/' + this.state.pageNum;
    
        const request = new Request(
          encodeURI(news_url),
          {
            method:'GET',
            headers: {
              'Authorization': 'bearer ' + Auth.getToken(),
            }
          });
    
        fetch(request)
          .then(res => res.json())
          .then(news => {
            if (!news || news.length == 0) {
              this.setState({loadedAll:true});
            }
    
            this.setState({
              news: this.state.news ? this.state.news.concat(news) : news,
              pageNum: this.state.pageNum + 1,
            });
          });
      }
    

    Preference Model

    • We need a model to represent user’s news preference.
    • Possible Dimensions:
    Topic - Political, Sport..
    Source - CNN, BBS ...
    Time - newest...
    

    Time Decay Model (Moving Average)

    • Topic Based
    • Based on user’s click
    • More weight on more recent activities
    • Topic is associated with predict click probability
    • All topics start with same probability.

    How to Update the table?

    • If selected : p = (1 - a) * p + a
    • If not selected : p = (1 - a) * p
    • a is the time decay weight: larger -> more weight on recent (picked a = 0.2)
    TopicProbability
    Sport0.33
    Entertainment0.33
    World0.33

    If User Clicks a news whoses topic is “Sport”

    • For Sport, we apply (1- 0.2) * 0.33 + 0.2
    • Others, we apply (1 - 0.2) * 0.33
    TopicProbability
    Sport0.464
    Entertainment0.264
    World0.264
    • Then, we could depend how to show those lists to our user

    Click Log Processor - Modify the model by User Clicks

    • Log then analyzing later / Real Time event

    Why we need this?

    • Understand user’s behavior
    • Improve UI experience
    • A/B Test
    • Pre-load / Pre-fetch
    • Future product

    Why using a separate processor?

    • Click log processing is not necessarily a synchronized step. (發進RabbitMQ中,不需要是一個blocking的操作,本身只需要紀錄行為不需要返回值,非緊急需要處理的事情先放入Queue中)
    • Don’t let backend service directly Load In a MongoDB to process Recommendation, we build up another Recommendation Service. (對於Backend Server來說,不需要知道DB內是如何存取Model的或是用哪種模型得到結果的,只要得到一個Topic List和preference)
    • 如果讓Backend Server直接聯繫DB,DB中Model的調整與修改就會影響到Backend Server,增加多於修改的工作,所以用一個Recommendation Service來傳遞DB內的資料和結果
    • Recommendation Service Client & Recommendation Service
    Client
      |  logNewsClickForUser(news_id, user_id)
     Web
    Server
      |   logNewsClickForUser(news_id, user_id)
    Backend
    Server --------------------------  Recommendation
      |      getPreferenceForUser      Service Client
    Click                                     |
    Logger  Send CLick Log mes:               |
    Queue   {news_id, user_id}                |
      |                                       |
    ClickLogProcessor                         |
    in DB       -----------  DB ------   Recommdndation
    :                                       Service
    ( Handle click Log event:
     Update preference model )
    

    Log Processer

    Client - Web Server / Client / NewsCard

    • POST a event to backend
    • Authorization : Need to have a token, not everyone could send event to backend
    • Fetch : Send out the request
      redirectToUrl(url, event) {
        event.preventDefault();
        this.sendClickLog();
    
     sendClickLog() {
        const url = 'http://' + window.location.hostname + ':3000' +
            '/news/userId/' + Auth.getEmail() + '/newsId/' + this.props.news.digest;
    
        const request = new Request(
          encodeURI(url),
          {
            method: 'POST',
            headers: { 'Authorization': 'bearer ' + Auth.getToken()},
          });
    
        fetch(request);
      }
    

    Web Server / Server / RPC Client

    • Make a new Function
    • no need a callback
    • Export the function
    function logNewsClickForUser(user_id, news_id) {
      client.request('logNewsClickForUser', [user_id, news_id], function(err, response){
        if(err) throw err;
        console.log(response.result);
      });
    }
    
    module.exports = {
      add : add,
      getNewsSummariesForUser: getNewsSummariesForUser,
      logNewsClickForUser : logNewsClickForUser
    }
    

    Test Rpc Client

    • Need to wait until we finish RPC server(python) Backend server
    client.logNewsClickForUser('test_user', 'test_news');
    

    Add a Router connect rpc_client and rpc server

    • news.js
    • post request
    router.post('/userId/:userId/newsId/:newsId', function(req, res, next) {
      console.log('Logging news click...');
      user_id = req.params['userId'];
      newsId = req.params['newsId'];
    
      rpc_client.logNewsClickForUser(user_id, newsId);
      res.status(200);
    });
    

    Backend Server

    • service.py
    def log_news_click_for_user(user_id, news_id):
        print("log_news_click_for_user is called with %s and %s" %(user_id, news_id))
        return operations.logNewsClickForUser(user_id, news_id)
    
    • operations.py
    • Strinlization
    • Give another queue to save the Reference data
    LOG_CLICKS_TASK_QUEUE_URL = #TODO: use your own config.
    LOG_CLICKS_TASK_QUEUE_NAME = "tap-news-log-clicks-task-queue"
    
    def logNewsClickForUser(user_id, news_id):
        # Send log task to machine learning service for prediction
        message = {'userId': user_id, 'newsId': news_id, 'timestamp': str(datetime.utcnow())}
        cloudAMQP_client.sendMessage(message);
    

    Recommendation Service

    Click Logger Processor

    mkdir recommendation_service
    code click_log_processor.py
    
    • Take a message from Queue each time
    def run():
        while True:
            if cloudAMQP_client is not None:
                msg = cloudAMQP_client.getMessage()
                if msg is not None:
                    # Parse and process the task
                    try:
                        handle_message(msg)
                    except Exception as e:
                        print(e)
                        pass
                # Remove this if this becomes a bottleneck.
                cloudAMQP_client.sleep(SLEEP_TIME_IN_SECONDS)
    

    Handle message

    • Update user’s preference into User preference Model Table
    PREFERENCE_MODEL_TABLE_NAME = "user_preference_model"
    
    db = mongodb_client.get_db()
        model = db[PREFERENCE_MODEL_TABLE_NAME].find_one({'userId': userId})
    
    • Init: Create a new Table with same perference rate
     if model is None:
            print('Creating preference model for new user: %s' % userId)
            new_model = {'userId' : userId}
            preference = {}
            for i in news_classes.classes:
                preference[i] = float(INITIAL_P)
            new_model['preference'] = preference
            model = new_model
    
        print('Updating preference model for new user: %s' % userId)
    
    • If user’s preference table has already been created, just update the modle
    • Class was made by TensorFlow
        news = db[NEWS_TABLE_NAME].find_one({'digest': newsId})
        if (news is None
            or 'class' not in news
            or news['class'] not in news_classes.classes):
            
            print('Skipping processing...')
            return
        
        click_class = news['class']
    

    Update the Table

       # Update the clicked one.
        old_p = model['preference'][click_class]
        model['preference'][click_class] = float((1 - ALPHA) * old_p + ALPHA)
    
        # Update not clicked classes.
        for i, prob in model['preference'].items():
            if not i == click_class:
                model['preference'][i] = float((1 - ALPHA) * model['preference'][i])
    
        print(model)
        db[PREFERENCE_MODEL_TABLE_NAME].replace_one({'userId': userId}, model, upsert=True)
    

    news_Classes - JSON dictionary (List) about all topics

    NUM_OF_CLASSES = 8
    =================
    for i in news_classes.classes:
    
    • The length of Lists need to be the same as Classes number
    classes = [
        "World",
        "US",
        "Business",
        "Technology",
        "Entertainment",
        "Sports",
        "Health",
        "Crime",
    ]
    

    Test

    • Create a message
    • Drop userId and make sure the DB is clean
    import click_log_processor
    import os
    import sys
    
    from datetime import datetime
    
    # import common package in parent directory
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'common'))
    
    import mongodb_client
    
    PREFERENCE_MODEL_TABLE_NAME = "user_preference_model"
    NEWS_TABLE_NAME = "news"
    
    NUM_OF_CLASSES = 8
    
    # Start MongoDB before running following tests.
    def test_basic():
        db = mongodb_client.get_db()
        db[PREFERENCE_MODEL_TABLE_NAME].delete_many({"userId": "test_user"})
    
        msg = {"userId": "test_user",
               "newsId": "test_news",
               "timestamp": str(datetime.utcnow())}
    
        click_log_processor.handle_message(msg)
    
        model = db[PREFERENCE_MODEL_TABLE_NAME].find_one({'userId':'test_user'})
        assert model is not None
        assert len(model['preference']) == NUM_OF_CLASSES
    
        print('test_basic passed!')
    
    
    if __name__ == "__main__":
        test_basic()
    

    Recommendation Service

    • 這部分需要再研究
    import operator
    import os
    import sys
    
    from jsonrpclib.SimpleJSONRPCServer import SimpleJSONRPCServer
    
    # import common package in parent directory
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'common'))
    
    import mongodb_client
    
    PREFERENCE_MODEL_TABLE_NAME = "user_preference_model"
    
    SERVER_HOST = 'localhost'
    SERVER_PORT = 5050
    
    # Ref: https://www.python.org/dev/peps/pep-0485/#proposed-implementation
    # Ref: http://stackoverflow.com/questions/5595425/what-is-the-best-way-to-compare-floats-for-almost-equality-in-python
    def isclose(a, b, rel_tol=1e-09, abs_tol=0.0):
        return abs(a-b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)
    
    def getPreferenceForUser(user_id):
        """ Get user's preference in an ordered class list """
        db = mongodb_client.get_db()
        model = db[PREFERENCE_MODEL_TABLE_NAME].find_one({'userId':user_id})
        if model is None:
            return []
    
        sorted_tuples = sorted(list(model['preference'].items()), key=operator.itemgetter(1), reverse=True)
        sorted_list = [x[0] for x in sorted_tuples]
        sorted_value_list = [x[1] for x in sorted_tuples]
    
        # If the first preference is same as the last one, the preference makes
        # no sense.
        if isclose(float(sorted_value_list[0]), float(sorted_value_list[-1])):
            return []
    
        return sorted_list
    
    
    # Threading HTTP Server
    RPC_SERVER = SimpleJSONRPCServer((SERVER_HOST, SERVER_PORT))
    RPC_SERVER.register_function(getPreferenceForUser, 'getPreferenceForUser')
    
    print("Starting HTTP server on %s:%d" % (SERVER_HOST, SERVER_PORT))
    
    RPC_SERVER.serve_forever()
    

    Recommendation service client

    • Let backend server (operations.py) to use the method
    import jsonrpclib
    
    URL = "http://localhost:5050/"
    
    client = jsonrpclib.ServerProxy(URL)
    
    def getPreferenceForUser(userId):
        preference = client.getPreferenceForUser(userId)
        print("Preference list: %s" % str(preference))
        return preference
    

    Modify operations to use Recommendation Service by importing

    import news_recommendation_service_client
    
    • Get Preference for the user
        # Get preference for the user
        preference = news_recommendation_service_client.getPreferenceForUser(user_id)
        topPreference = None
    
        if preference is not None and len(preference) > 0:
            topPreference = preference[0]
    
        for news in sliced_news:
            # Remove text field to save bandwidth.
            del news['text']
            if news['class'] == topPreference:
                news['reason'] = 'Recommend'
            if news['publishedAt'].date() == datetime.today().date():
                news['time'] = 'today'
        return json.loads(dumps(sliced_news))
    

    Week 4

    • Jupyter : On Docker
    • Build a CNN Model
    • Trainer
    • Server
    • Integration with News Pipeline

    Jupyter

    Data Format

    • We will use 500 news with labeled topic to train our model. The format of the data is
    [#class_number],[news_title],[news_description],[news_source]
    
    • An Example:
    3,Trump: Netanyahu should 'hold back' on settlements,"President Donald Trump told Israeli Prime Minister Benjamin Netanyahu Wednesday that the US is going to push for a peace deal with the Palestinians and asked Israel ""to hold back"" on settlement construction.",cnn
    

    8 Classes:

    • 1: “World”,
    • 2: “US”,
    • 3: “Business”,
    • 4: “Technology”,
    • 5: “Entertainment”,
    • 6: “Sports”,
    • 7: “Health”,
    • 8: “Crime”

    See the Sorce : by pandas

    import pandas as pd
    
    DATA_SET_FILE = './labeled_news.csv'
    
    df = pd.read_csv(DATA_SET_FILE, header=None)
    
    print(df[3])
    

    Statistics

    Let’s first take a look at our data from a statistics perspective.

    We can use pandas to do a quick
    (data analysis)[http://machinelearningmastery.com/quick-and-dirty-data-analysis-with-pandas/]

    • In order to see the distribution of data.
    • We need to get to know more about our data by quick analysising it using pandas.
    import pandas as pd
    
    DATA_SET_FILE = './labeled_news.csv'
    
    df = pd.read_csv(DATA_SET_FILE, header=None)
    
    
    print("class description")
    print(df[0].describe())
    
    print("""
    
    =====================
    
    """)
    
    print("source description")
    print(df[3].describe())
    
    class description
    count    545.000000
    
    source description
    count     545
    unique     13
    top       cnn
    freq      132
    Name: 3, dtype: object
    

    Source Distribution

    df[3].value_counts().plot(kind="bar")
    

    Build a CNN Model

    Vocabulary Embedding

    • Tensorflow provides a easy-to-use tool to do embedding.
    • One example:
        Sentence 1: "I like apple"
        Sentence 2: "I like banana"
        Sentence 3: "I eat apple"
    
    • We scan all 3 sentences then we found there are 5 unique words:
        ['I', 'like', 'eat', 'apple', 'banana']
    Now we can use number to represent each of them:
    
        [0, 1, 2, 3, 4]
    So, the three sentences can be encoded into:
    
        Sentence 1: [0, 1, 3]
        Sentence 2: [0, 1, 4]
        Sentence 3: [0, 2, 3]
    

    VocabularyProcessor

    • Tensorflow provides a easy-to-use tool to do embedding.
    import numpy as np
    import tensorflow as tf
    
    MAX_DOCUMENT_LENGTH = 5
    
    vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
    
    # fit and transform
    sentences = [
        "I like apple",
        "I like banana",
        "I eat apple"
    ]
    
    embedded_sentences = np.array(list(vocab_processor.fit_transform(sentences)))
    print(embedded_sentences)
    
    print('Total unique words: %d' % len(vocab_processor.vocabulary_))
    
    print("")
    # just fit: unseen words will be interpreted as 0
    new_sentences = [
        "This is a brand new sentence which we never seen before and is very long",
        "I hate doing homework!",
        "I like banana",
        "I eat apple"
    ]
    
    new_embedded_sentences = np.array(list(vocab_processor.transform(new_sentences)))
    print(new_embedded_sentences)
    
    • Result : Total unique words: 6
    [[1 2 3 0 0]
     [1 2 4 0 0]
     [1 5 3 0 0]]
    Total unique words: 6
    
    [[0 0 0 0 0]
     [1 0 0 0 0]
     [1 2 4 0 0]
     [1 5 3 0 0]]
    

    One-hot Embedding

    We have converted the string into an integer vector. But that is not good enough. We need to convert a word into one-hot vector:

    Assume we only have 10 unique words.
    
        0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
        2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
        3 -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
        4 -> [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
        5 -> [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
        6 -> [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
        7 -> [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
        8 -> [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
        9 -> [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
    

    Thus, one setence is converted into a matrix:

    Assume there are only 5 unique words:
    
        "I like apple" 
    -> 
        [1 2 3 0 0] 
    ->
        [
            [0, 1, 0, 0, 0],  => I
            [0, 0, 1, 0, 0],  => like
            [0, 0, 0, 1, 0],  => apple
            [1, 0, 0, 0, 0],  => padding
            [1, 0, 0, 0, 0]   => padding
        ]
    

    Tensorflow embed_sequence

    Tensorflow provides tf.contrib.layers.embed_sequence which can help use with one-hot embedding.

    In real world, we don’t use native [0, 1] way to label a word, but use a real number.

    import tensorflow as tf
    
    vocab_size = 6
    embed_dim =  3
    
    sentences = [
        [1, 2, 3, 0, 0]
    ]
    embedded_setences = tf.contrib.layers.embed_sequence(sentences, vocab_size, embed_dim)
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        result = sess.run(embedded_setences)
        print(result)
    

    Convolutional Neural Networks (CNNs)

    We cannot dive deep into what is convolution here.

    But here is a great article about CNNs for NLP:
    http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

    Convolution

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GDYE1ztK-1608896291256)(http://deeplearning.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif)]

    Pooling

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TUr8JVUk-1608896291259)(http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-05-at-2.18.38-PM.png)]

    Put Eveyrthing Together

    # -*- coding: utf-8 -*-
    
    import numpy as np
    import os
    import pandas as pd
    import pickle
    import shutil
    import tensorflow as tf
    
    
    from sklearn import metrics
    
    learn = tf.contrib.learn
    
    DATA_SET_FILE = './labeled_news.csv'
    
    MAX_DOCUMENT_LENGTH = 500
    N_CLASSES = 8
    
    
    EMBEDDING_SIZE = 100
    N_FILTERS = 10
    WINDOW_SIZE = 10
    FILTER_SHAPE1 = [WINDOW_SIZE, EMBEDDING_SIZE]
    FILTER_SHAPE2 = [WINDOW_SIZE, N_FILTERS]
    POOLING_WINDOW = 4
    POOLING_STRIDE = 2
    
    LEARNING_RATE = 0.01
    STEPS = 200
    
    def generate_cnn_model(n_classes, n_words):
        """2 layer CNN to predict from sequence of words to a class."""
        def cnn_model(features, target):
            # Convert indexes of words into embeddings.
            # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
            # maps word indexes of the sequence into [batch_size, sequence_length,
            # EMBEDDING_SIZE].
    
            target = tf.one_hot(target, n_classes, 1, 0)
            word_vectors = tf.contrib.layers.embed_sequence(
                features, vocab_size=n_words, embed_dim=EMBEDDING_SIZE, scope='words')
    
            word_vectors = tf.expand_dims(word_vectors, 3)
            
            with tf.variable_scope('CNN_layer1'):
                # Apply Convolution filtering on input sequence.
                conv1 = tf.contrib.layers.convolution2d(
                    word_vectors, N_FILTERS, FILTER_SHAPE1, padding='VALID')
                # Add a RELU for non linearity.
                conv1 = tf.nn.relu(conv1)
                # Max pooling across output of Convolution+Relu.
                pool1 = tf.nn.max_pool(
                    conv1,
                    ksize=[1, POOLING_WINDOW, 1, 1],
                    strides=[1, POOLING_STRIDE, 1, 1],
                    padding='SAME')
                # Transpose matrix so that n_filters from convolution becomes width.
                pool1 = tf.transpose(pool1, [0, 1, 3, 2])
          
            with tf.variable_scope('CNN_layer2'):
                # Second level of convolution filtering.
                conv2 = tf.contrib.layers.convolution2d(
                    pool1, N_FILTERS, FILTER_SHAPE2, padding='VALID')
                # Max across each filter to get useful features for classification.
                pool2 = tf.squeeze(tf.reduce_max(conv2, 1), squeeze_dims=[1])
    
            # Apply regular WX + B and classification.
            logits = tf.contrib.layers.fully_connected(pool2, n_classes, activation_fn=None)
            loss = tf.contrib.losses.softmax_cross_entropy(logits, target)
    
            train_op = tf.contrib.layers.optimize_loss(
              loss,
              tf.contrib.framework.get_global_step(),
              optimizer='Adam',
              learning_rate=LEARNING_RATE)
    
            return ({
              'class': tf.argmax(logits, 1),
              'prob': tf.nn.softmax(logits)
            }, loss, train_op)
    
        return cnn_model
    
    
    def main(unused_argv):
        # Prepare training and testing data
        df = pd.read_csv(DATA_SET_FILE, header=None)
    
        # Random shuffle
        df.sample(frac=1)
    
        train_df = df[0:450]
        test_df = df.drop(train_df.index)
    
        # x - news title, y - class
        x_train = train_df[1]
        y_train = train_df[0]
        x_test = test_df[1]
        y_test = test_df[0]
    
        # Process vocabulary
        vocab_processor = learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
        x_train = np.array(list(vocab_processor.fit_transform(x_train)))
        x_test = np.array(list(vocab_processor.transform(x_test)))
    
        n_words = len(vocab_processor.vocabulary_)
        print('Total words: %d' % n_words)
      
        # Build model
        classifier = learn.Estimator(
            model_fn=generate_cnn_model(N_CLASSES, n_words))
    
        # Train and predict
        classifier.fit(x_train, y_train, steps=STEPS)
    
        # Evaluate model
        y_predicted = [
            p['class'] for p in classifier.predict(x_test, as_iterable=True)
        ]
        print(y_predicted)
    
        score = metrics.accuracy_score(y_test, y_predicted)
        print('Accuracy: {0:f}'.format(score))
    
    if __name__ == '__main__':
        tf.app.run(main=main)
    

    News Topic Modeling Server

    展开全文
  • A browser extension that alerts users to unreliable news sources
  • Quantifying Hacker News...I thought it would be fun to analyze the activity on one of my favorite sources of interesting links and information,Hacker News. My source of data is a script I've set up so...

    Quantifying Hacker News

    I thought it would be fun to analyze the activity on one of my favorite sources of interesting links and information, Hacker News. My source of data is a script I've set up some time in August that downloads HN (the Front page and the New stories page) every minute. We will be interested in visualizing the stories as they get upvoted during the day, figuring out which domains/users are most popular, what topics are most popular, and the best time to post a story. I'm making all my data and code (Python data collection scripts + IPython Notebook for analysis) available in case you'd like to carry out a similar analysis.

    Data collection protocol

    I set up a very simple python script that scrapes the HN front page and the new stories page every minute. A single day of data begins at 4am (PST) and ends at 4am the next day. The .html files are saved compressed as gzipped pickles and one day occupies roughly 10mb in this format. I had bring down my machine for a few days a few times so there are some gaps in the data, but in the end we get 47 days of data from period between August 22 and October 30.

    Raw HTML data parsing

    The parsing Python script uses BeautifulSoup to convert the raw HTML into a more structured JSON. This script was by the way by no means simple to write -- HN is based on unstructured tables and I had to discover many strange edge cases in its behavior along the way. At the end I ended up with a 100-line ugliest-parsing-function-ever (really, I'm not proud of it) but it works and outputs something like the following for a single story at a specific snapshot:

    {
    'domain': u'play.google.com', 'title': u'Nexus 5', 'url': u'https://play.google.com/store/devices/details?id=nexus_5_black_16gb', 'num_comments': 42, 'rank': 1, 'points': 65, 'user': u'sonier', 'minutes_ago': 39, 'id': u'6648519' }

    We get 60 such entries every minute (30 for front page and 30 for new page) and these are again all saved to disk. We are now ready to bring out the IPython Notebook and get to the juicy analysis!

    The Analysis: Detailed analysis

    Head over to the IPython Notebook rendered as HTML for the analysis:

    Note: I had the entire dataset and .ipynb Ipython Notebook source available for download but recently took it down to save space on my host (sorry).

     

    from: http://karpathy.github.io/2013/11/27/quantifying-hacker-news/

    转载于:https://www.cnblogs.com/GarfieldEr007/p/5377893.html

    展开全文
  • Terracotta open sources JVM clustering [url]http://www.infoq.com/news/2006/12/terracotta-jvm-clustering[/url] 呵呵,想说点什么,不过看到这个回贴,就不多废话了: ...
  • SD Times news digest: October 2, 2014—Ericsson open-sources OpenWebRTC and Bowser for iOS; Intel releases IoT developer kit
  • cakephp中的datasources

    2011-07-30 12:12:21
    http://www.codeforest.net/demo/cake_data_sources/news/index/page:2 查看CAKEPHP手册,可以知道,要写datasources,实现如下方法之一即可: create($model, $fields = array(), $values = array()) read...
  • android-dev-sources:您需要成为所有并保持出色的所有Android开发源!
  • \u003cp\u003eTriggerMesh发布了最新的开源项目Knative Lambda Sources(KLASS)。KLASS是指可用于触发Kubernetes集群Knative Function的事件源,支持在多云或内部环境中消费AWS事件。而此前发布的Knative Lambda ...
  • Android: Attaching Sources to Libraries in Eclipse Since ADT r17, Android projects in Eclipse have had a nice property: all jars in your libs directory have been automatically picked up by Eclipse .....
  • mpqa_1_2_database.tar.gz

    2021-07-03 11:19:49
    The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc....
  • News Source

    2019-08-01 16:16:31
    BBC http://www.bbc.co.uk/ BBC Radio http://www.bbc.co.uk/radio/ CNN http://www.cnn.com/ CNN.com Live http://edition.cnn.com/video/live/live.html?stream=stream1 New York Times ...
  • plex 乱码ThePlex Newsfeature gives you access to tons of video clips from various news sources. However, if you get your news other ways, you can disable Plex News and prevent it from showing up in yo...
  • 创建 News 的动机源于对 Apple News 的挫败感,Apple News 似乎从未完全了解我的兴趣并完全控制我看到的新闻来源。 新闻直接显示我个人选择的适合我自己口味的来源的 RSS 提要。 请 fork 这个 repo 的 gh-pages ...
  • In this article, we are going to create a simple News App in ... Users would be able to read live news articles from different news sources in this APP. You should have basic knowledge of Java, O...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 7,681
精华内容 3,072
关键字:

newssources