Selector API
xray(url, selector)(fn)
Scrape the url
for the following selector
, returning an object in the callback fn
.
The selector
takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is selector@attribute
. If you do not supply an attribute, the default is selecting the innerText
.
Here are a few examples:
- Scrape a single tag
xray('http://google.com', 'title')(function(err, title) {
console.log(title) // Google
})
- Scrape a single class
xray('http://reddit.com', '.content')(fn)
- Scrape an attribute
xray('http://techcrunch.com', 'img.logo@src')(fn)
- Scrape
innerHTML
xray('http://news.ycombinator.com', 'body@html')(fn)
xray(url, scope, selector)
You can also supply a scope
to each selector
. In jQuery, this would look something like this: $(scope).find(selector)
.
xray(html, scope, selector)
Instead of a url, you can also supply raw HTML and all the same semantics apply.
var html = '<body><h2>Pear</h2></body>'
x(html, 'body', 'h2')(function(err, header) {
header // => Pear
})
API
xray.driver(driver)
Specify a driver
to make requests through. Available drivers include:
- request - A simple driver built around request. Use this to set headers, cookies or http methods.
- phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).
xray.stream()
Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here’s an example with Express:
var app = require('express')()
var x = require('x-ray')()
app.get('/', function(req, res) {
var stream = x('http://google.com', 'title').stream()
stream.pipe(res)
})
xray.write([path])
Stream the results to a path
.
If no path is provided, then the behavior is the same as .stream().
xray.then(cb)
Constructs a Promise
object and invoke its then
function with a callback cb
. Be sure to invoke then()
at the last step of xray method chaining, since the other methods are not promisified.
x('https://dribbble.com', 'li.group', [
{
title: '.dribbble-img strong',
image: '.dribbble-img [data-src]@data-src'
}
])
.paginate('.next_page@href')
.limit(3)
.then(function(res) {
console.log(res[0]) // prints first result
})
.catch(function(err) {
console.log(err) // handle error in promise
})
xray.paginate(selector)
Select a url
from a selector
and visit that page.
xray.limit(n)
Limit the amount of pagination to n
requests.
xray.abort(validator)
Abort pagination if validator
function returns true
.
The validator
function receives two arguments:
result
: The scrape result object for the current page.nextUrl
: The URL of the next page to scrape.
xray.delay(from, [to])
Delay the next request between from
and to
milliseconds.
If only from
is specified, delay exactly from
milliseconds.
var x = Xray().delay('1s', '10s')
xray.concurrency(n)
Set the request concurrency to n
. Defaults to Infinity
.
var x = Xray().concurrency(2)
xray.throttle(n, ms)
Throttle the requests to n
requests per ms
milliseconds.
var x = Xray().throttle(2, '1s')
xray.timeout (ms)
Specify a timeout of ms
milliseconds for each request.
var x = Xray().timeout(30)
Composition
X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:
Crawling to another site
var Xray = require('x-ray')
var x = Xray()
x('http://google.com', {
main: 'title',
image: x('#gbar a@href', 'title') // follow link to google images
})(function(err, obj) {
/*
{
main: 'Google',
image: 'Google Images'
}
*/
})
Scoping a selection
var Xray = require('x-ray')
var x = Xray()
x('http://mat.io', {
title: 'title',
items: x('.item', [
{
title: '.item-content h2',
description: '.item-content section'
}
])
})(function(err, obj) {
/*
{
title: 'mat.io',
items: [
{
title: 'The 100 Best Children\'s Books of All Time',
description: 'Relive your childhood with TIME\'s list...'
}
]
}
*/
})
Filters
Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using |
.
var Xray = require('x-ray')
var x = Xray({
filters: {
trim: function(value) {
return typeof value === 'string' ? value.trim() : value
},
reverse: function(value) {
return typeof value === 'string'
? value
.split('')
.reverse()
.join('')
: value
},
slice: function(value, start, end) {
return typeof value === 'string' ? value.slice(start, end) : value
}
}
})
x('http://mat.io', {
title: 'title | trim | reverse | slice:2,3'
})(function(err, obj) {
/*
{
title: 'oi'
}
*/
})
Examples
- selector: simple string selector
- collections: selects an object
- arrays: selects an array
- collections of collections: selects an array of objects
- array of arrays: selects an array of arrays
In the Wild
- Levered Returns: Uses x-ray to pull together financial data from various unstructured sources around the web.
Resources
Sponsors
Become a sponsor and get your logo on our website and on our README on Github with a link to your site. [Become a sponsor]