Elasticsearch

未匹配的标注

Elasticsearch Extension for Yii 2

This extension provides the Elasticsearch integration for the Yii2 framework. It includes basic querying/search support and also implements the ActiveRecord pattern that allows you to store active records in Elasticsearch.

安装

Requirements

The extension is designed to support Elasticsearch 5.0 and above. It has been tested with the latest versions of Elasticsearch 5.x, 6.x, and 7.x branches.

Configuring Elasticsearch

The extension uses inline scripts for some of its functionality (like the [[yii\elasticsearch\ActiveRecord::updateAllCounters()|updateAllCounters()]] method). The script is written in painless, which is run by Elasticsearch in a sanboxed manner. Because it is generally enabled by default, no special configuration is required. However, for older versions of Elasticsearch (like 5.0), you may need to enable inline scripts to support this functionality. See Elasticsearch documentation for details.

Getting Composer package

The preferred way to install this extension is through composer:

composer require yiisoft/yii2-elasticsearch

Configuring application

To use this extension, you need to configure the [[yii\elasticsearch\Connection|Connection]] class in your application configuration:

return [
    //....
    'components' => [
        'elasticsearch' => [
            'class' => 'yii\elasticsearch\Connection',
            'nodes' => [
                ['http_address' => '127.0.0.1:9200'],
                // configure more hosts if you have a cluster
            ],
            // set autodetectCluster to false if you don't want to auto detect nodes
            // 'autodetectCluster' => false,
            'dslVersion' => 7, // default is 5
        ],
    ],
];

The connection needs to be configured with at least one node. The default behavior is cluster autodetection. The extension makes a GET /_nodes request to the first node in the list, and gets the addresses of all the nodes in the cluster. An active node is then randomly selected from the updated node list.

This behavior can be disabled by setting [[yii\elasticsearch\Connection::$autodetectCluster|$autodetectCluster]] to false. In that case an active node will be randomly selected from the nodes given in the configuration.

For cluster autodetection to work properly, the GET /_nodes request to the nodes specified in the configuration must return the http_address field for each node. This is returned by vanilla Elasticsearch instances by default, but has been reported to not be available in environments like AWS. In that case you need to disable cluster detection and specify hosts manually.

It may also be useful to disable cluster autodetection for performance reasons. If a cluster has a single dedicated coordinating-only node, it makes sense to direct all requests to that node. If a cluster contains only a few nodes and their addresses are known, it may be useful to specify them explicitly.

You should set the version of the domain-specific language the extension will use to communicate with the server. The value corresponds to the version of the Elasticsearch server. For 5.x branch set [[yii\elasticsearch\Connection::$dslVersion|$dslVersion]] to 5, for 6.x branch to 6, for 7.x branch to 7. Default is 5.

Mapping & Indexing

Comparison with SQL

Elasticsearch documentation provides an extensive list of concepts in Elasticsearch and SQL and how they map to one another. We’ll focus on the basics.

An Elasticsearch cluster consists of one or more Elasticsearch instances. Requests are sent to one of the instances, which propagates the query to other instances in the cluster, collects results, and then returns them to the client. Therefore a cluster or an instance that represents it roughly correspond to a SQL database.

In Elasticsearch data is stored in indices. An index corresponds to a SQL table.

An index contains documents. Documents correspond to rows in a SQL table. In this extension, an [[yii\elasticsearch\ActiveRecord|ActiveRecord]] represents a document in an index. The operation of saving a document into an index is called indexing.

The schema or structure of a document is defined in the so-called mapping. A mapping defines document fields, which correspond to columns in SQL. In Elasticsearch the primary key field is special, because it always exists and its name and structure can not be changed. Other fields are fully configurable.

Mapping fields beforehand

Even though new fields will be created on the fly when documents are indexed, it is considered good practice to define a mapping before indexing documents.

Generally, once an attribute is defined, it is not possible to change its type. For example if a text field is configured to use the English language analyzer, it is not possible to switch to a different language without reindexing every document in the index. Certain limited modifications to mapping can be applied on the fly. See Elasticsearch documentation for more info.

Document types

Originally, Elasticsearch was designed to store documents with different structure in the same index. To handle this, a concept of “type” was introduced. However, this approach soon fell out of favor. As a result, types have been removed from Elasticsearch 7.x.

Currently, best practice is to have only one type per index. Technically, if the extension is configured for Elasticsearch 7 or above, [[yii\elasticsearch\ActiveRecord::type()|type()]] is ignored, and implicitly replaced with _doc where required by the API.

Creating helper methods

Our recommendation is to create several static methods in your [[yii\elasticsearch\ActiveRecord|ActiveRecord]] model that deal with index creation and updates. Here is one example of how this can be done.

class Customer extends yii\elasticsearch\ActiveRecord
{
    // Other class attributes and methods go here
    // ...

    /**
     * @return array This model's mapping
     */
    public static function mapping()
    {
        return [
            // Field types: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#field-datatypes
            'properties' => [
                'first_name'     => ['type' => 'text'],
                'last_name'      => ['type' => 'text'],
                'order_ids'      => ['type' => 'keyword'],
                'email'          => ['type' => 'keyword'],
                'registered_at'  => ['type' => 'date'],
                'updated_at'     => ['type' => 'date'],
                'status'         => ['type' => 'keyword'],
                'is_active'      => ['type' => 'boolean'],
            ]
        ];
    }

    /**
     * Set (update) mappings for this model
     */
    public static function updateMapping()
    {
        $db = static::getDb();
        $command = $db->createCommand();
        $command->setMapping(static::index(), static::type(), static::mapping());
    }

    /**
     * Create this model's index
     */
    public static function createIndex()
    {
        $db = static::getDb();
        $command = $db->createCommand();
        $command->createIndex(static::index(), [
            //'aliases' => [ /* ... */ ],
            'mappings' => static::mapping(),
            //'settings' => [ /* ... */ ],
        ]);
    }

    /**
     * Delete this model's index
     */
    public static function deleteIndex()
    {
        $db = static::getDb();
        $command = $db->createCommand();
        $command->deleteIndex(static::index(), static::type());
    }
}

To create the index with proper mappings, call Customer::createIndex(). If you have changed the mapping in a way that allows mapping update (e.g. created a new property), call Customer::updateMapping().

However, if you have changed a property (e.g. went from string to date), Elasticsearch will not be able to update the mapping. In this case you need to delete your index (by calling Customer::deleteIndex()), create it anew with updated mapping (by calling Customer::createIndex()), and then repopulate it with data.

Using the Query

The [[yii\elasticsearch\Query]] class is generally compatible with its [[yii\db\Query|parent query class]], well-described in the guide.

The differences are outlined below.

  • As Elasticsearch does not support SQL, the query API does not support join(), groupBy(), having(), and union(). Sorting, limit(), offset(), limit(), and where() are all supported (with certain limitations).

  • [[yii\elasticsearch\Query::from()|from()]] does not select the tables, but the index and type to query against.

  • select() has been replaced with [[yii\elasticsearch\Query::storedFields()|storedFields()]]. It defines the fields to retrieve from a document, similar to columns in SQL.

  • As Elasticsearch is not only a database but also a search engine, additional query and aggregation mechanisms are supported. Check out the Query DSL on how to compose queries.

Executing queries

The [[yii\elasticsearch\Query]] class provides the usual methods for executing queries: [[yii\elasticsearch\Query::one()|one()]] and [[yii\elasticsearch\Query::all()|all()]]. They return only the search results (or a single result).

There is also the [[yii\elasticsearch\Query::search()|search()]] method that returns both the search results, and all of the metadata retrieved from Elasticsearch, including aggregations.

The extension fully supports the highly efficient scroll mode, that allows to retrieve large results sets. See [[yii\elasticsearch\Query::batch()|batch()]] and [[yii\elasticsearch\Query::each()|each()]] for more information.

Number of returned records and pagination caveats

Unlike most SQL servers that will return all results unless a LIMIT clause is provided, Elasticsearch limits the result set to 10 records by default. To get more records, use [[yii\elasticsearch\Query::limit()|limit()]]. This is especially important when defining relations in [[yii\elasticsearch\ActiveRecord|ActiveRecord]], where record limit needs to be specified explicitly.

Elasticsearch is generally poor suited to tasks that require deep pagination. It is optimized for search engine behavior, where only first few pages of results have any relevance. While it is technically possible to go far into the result set using [[yii\elasticsearch\Query::limit()|limit()]] and [[yii\elasticsearch\Query::offset()|offset()]], performance is reduced.

One possible solution would be to use the scroll mode, which behaves similar to cursors in traditional SQL databases. Scroll mode is implemented with [[yii\elasticsearch\Query::batch()|batch()]] and [[yii\elasticsearch\Query::each()|each()]] methods.

Error handling in queries

Elasticsearch is a distributed database. Because of its distributed nature, certain requests may be partially successful.

Consider how a typical search is performed. The query is sent to all relevant shards, then their results are collected, processed, and returned to user. It is possible that not all shards are able to return a result. Yet, even with some data missing, the result may be useful.

With every query the server returns some additional metadata, including data on which shards failed. This data is lost when using standard Yii2 methods like [[yii\elasticsearch\Query::one()|one()]] and [[yii\elasticsearch\Query::all()|all()]]. Even if some shards failed, it is not considered a server error.

To get extended data, including shard statictics, use the [[yii\elasticsearch\Query::search()|search()]] method.

The query itself can also fail for a number of reasons (connectivity issues, syntax error, etc.) but that will result in an exception.

Error handling in bulk requests

In Elasticsearch a bulk request performs multiple operations in a single API call. This reduces overhead and can greatly increase indexing speed.

The operations are executed individually, so some can be successful, while others fail. Having some of the operations fail does not cause the whole bulk request to fail. If it is important to know if any of the constituent operations failed, the [[yii\elasticsearch\BulkCommand::execute()|result of the bulk request]] needs to be checked.

The bulk request itself can also fail, for example, because of connectivity issues, but that will result in an exception.

Document counts in ES > 7.0.0

As of Elasticsearch 7.0.0, for result sets over 10 000 hits, document counts (total_hits) are no longer exact by default. In other words, if the result set contains more than 10 000 documents, total_hits is reported as 10 000, and if it is less, then it is reported exactly. This results in a performance improvement.

The track_total_hits option can be used to change this behavior. If it is set to 'true', exact document count will always be returned, and an integer value overrides the default threshold value of 10 000.

$query = new Query();
$query->from('customer');

// Note the literal string 'true', not a boolean value!
$query->addOptions(['track_total_hits' => 'true']);

Runtime Fields/Mappings in ES >= 7.11

Runtime Fields are fields that can be dynamically generated at query time by supplying a script similar to script_fields. The major difference being that the value of a Runtime Field can be used in search queries, aggregations, filtering, and sorting.

Any Runtime Field values that you want to be included in the search results must be added to the field array by passing an array of field names using the fields() method.

Example for fetching users’ full names by concatenating the first_name and last_name fields from the index and sorting them alphabetically.

$results = (new yii\elasticsearch\Query())
    ->from('users')
    ->runtimeMappings([
        'full_name' => [
            'type' => 'keyword',
            'script' => "emit(doc['first_name'].value + ' ' + doc['last_name'].value)",
        ],
    ])
    ->fields(['full_name'])
    ->orderBy(['full_name' => SORT_ASC])
    ->search($connection);

For more information concerning type and script please see Elastic’s Runtime Field Documentation

Using the ActiveRecord

Elasticsearch ActiveRecord is very similar to the database ActiveRecord as described in the guide.

Most of its limitations and differences are derived from the [[yii\elasticsearch\Query]] implementation.

For defining an Elasticsearch ActiveRecord class your record class needs to extend from [[yii\elasticsearch\ActiveRecord]] and implement at least the [[yii\elasticsearch\ActiveRecord::attributes()|attributes()]] method to define the attributes of the record.

NOTE: It is important NOT to include the primary key attribute (_id) in the attributes.

class Customer extends yii\elasticsearch\ActiveRecord
{
    // Other class attributes and methods go here
    // ...
    public function attributes()
    {
        return ['first_name', 'last_name', 'order_ids', 'email', 'registered_at', 'updated_at', 'status', 'is_active'];
    }
}

You may override [[yii\elasticsearch\ActiveRecord::index()|index()]] and [[yii\elasticsearch\ActiveRecord::type()|type()]] to define the index and type this record represents.

NOTE: Type is ignored for Elasticsearch 7.x and above. See Data Mapping & Indexing for more information.

Usage examples

// Creating a new record
$customer = new Customer();
$customer->_id = 1; // setting primary keys is only allowed for new records
$customer->last_name = 'Doe'; // attributes can be set one by one
$customer->attributes = ['first_name' => 'Jane', 'email' => 'janedoe@example.com']; // or together
$customer->save();

// Getting records using the primary key
$customer = Customer::get(1); // get a record by pk
$customer = Customer::findOne(1); // also works
$customers = Customer::mget([1,2,3]); // get multiple records by pk
$customers = Customer::findAll([1, 2, 3]); // also works

// Finding records using simple conditions
$customer = Customer::find()->where(['first_name' => 'John', 'last_name' => 'Smith'])->one();

// Finding records using query DSL
// (see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html)
$articles = Article::find()->query(['match' => ['title' => 'yii']])->all();

$articles = Article::find()->query([
    'bool' => [
        'must' => [
            ['term' => ['is_active' => true]],
            ['terms' => ['email' => ['johnsmith@example.com', 'janedoe@example.com']]]
        ]
    ]
])->all();

Primary keys

Unlike traditional SQL databases that let you define a primary key as any column or a set of columns, or even create a table without a primary key, Elasticsearch stores the primary key separately from the rest of the document. The key is not the part of the document structure and can not be changed once the document is saved into the index.

While Elasticsearch can create unique primary keys for new documents, it is also possible to specify them explicitly for new records. Note that the key attribute is a string and is limited to 512 bytes. See Elasticsearch docs for more information.

In Elasticsearch, the name of the primary key is _id, and [[yii\elasticsearch\ActiveRecord]] provides getter and setter methods to access it as a property. There is no need to add it to [[yii\elasticsearch\ActiveRecord::attributes()|attributes()]].

Foreign keys

SQL databases often use autoincremented integer columns as primary keys. When models from such databases are used in relations in Elasticsearch models, those integers effectively become foreign keys.

Even though these keys are technically numeric, generally they should not be mapped as a numeric field datatype. Elasticsearch optimizes numeric fields, such as integer or long, for range queries. However, keyword fields are better for term and other term-level queries. Therefore it is recommended to use keyword field type for foreign keys. See Elasticsearch docs for more information on keyword fields.

Defining relations

It is possible to define relations from Elasticsearch ActiveRecords to other Elasticsearch and non-Elasticsearch ActiveRecord classes and vice versa. However, [[yii\elasticsearch\ActiveQuery::via()|Via]]-relations can not be defined using a table as there are no tables in Elasticsearch. You can only define such relations using other relations.

class Customer extends yii\elasticsearch\ActiveRecord
{
    // Every customer has multiple orders, every order has exactly one invoice

    public function getOrders()
    {
        // This relation gets up to 100 most recent orders of current customer
        return $this->hasMany(Order::className(), ['customer_id' => '_id'])
                    ->orderBy(['created_at' => SORT_DESC])
                    ->limit(100); // override the default limit of 10
    }

    public function getInvoices()
    {
        // This via-relation works by fetching the related "orders"
        // models first. This query also needs a limit, but it makes
        // no sense to make that limit different from the underlying
        // relation.
        return $this->hasMany(Invoice::className(), ['_id' => 'order_id'])
                    ->via('orders')->limit(100);
    }
}

NOTE: Elasticsearch limits the number of records returned by any query to 10 records by default. This applies to queries executed when getting related models. If you expect to get more records you should specify the limit explicitly in relation definition. It is also important for [[yii\elasticsearch\ActiveQuery::via()|via]]-relations to set the proper limit both in the relation itself as well as the underlying relation that is used as an intermediary.

Scalar and array attributes

Any field in an Elasticsearch document can hold multiple values. For example, if a customer mapping includes a keyword field for order ID, it is automatically possible to create a document with one, two, or more order IDs. One can say that every field in a document is an array.

For consistency with [[yii\base\ActiveRecord]], when populating the record from data, single-item arrays are replaced with the value they contain. However, it is possible to override this behavior by defining [[yii\elasticsearch\ActiveRecord::arrayAttributes()|arrayAttributes()]].

public function arrayAttributes()
{
    return ['order_ids'];
}

This way once fetched from the database, $customer->order_ids will be an array even if it contains one item, e.g. ['AB-32162'].

Organizing complex queries

Any query can be composed using Elasticsearch’s query DSL and passed to the [[yii\elasticsearch\Query::query()|query()]] method. However, ES query DSL is notorious for its verbosity, and these oversized queries soon become unmanageable.

The usual approach with SQL ActiveRecord classes is to create scopes using methods in the query class that modify the query itself. This does not work so well with Elasticsearch, so the recommended approach is to create static methods that return building blocks of the query, then combine them.

class CustomerQuery extends ActiveQuery
{
    public static function name($name)
    {
        return ['match' => ['name' => $name]];
    }

    public static function address($address)
    {
        return ['match' => ['address' => $address]];
    }

    public static function registrationDateRange($dateFrom, $dateTo)
    {
        return ['range' => ['registered_at' => [
            'gte' => $dateFrom,
            'lte' => $dateTo,
        ]]];
    }
}

Now these sub-queries can be used to build the query.

$customers = Customer::find()->query([
    'bool' => [
        'must' => [
            CustomerQuery::registrationDateRange('2016-01-01', '2016-01-20')
        ],
        'should' => [
            CustomerQuery::name('John'),
            CustomerQuery::address('London'),
        ],
        'must_not' => [
            CustomerQuery::name('Jack'),
        ],
    ],
])->all();

Aggregations

The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks called aggregations, that can be composed in order to build complex summaries of the data.

As an example, let’s determine how many customers have been registered each month.

$searchResult = Customer::find()->addAggregate('customers_by_date', [
    'date_histogram' => [
        'field' => 'registered_at',
        'calendar_interval' => 'month',
    ],
])->limit(0)->search();

$customersByDate = ArrayHelper::map($searchResult['aggregations']['customers_by_date']['buckets'], 'key_as_string', 'doc_count');

Note that in this example [[yii\elasticsearch\ActiveQuery::search()|search()]] is used in place of [[yii\elasticsearch\ActiveQuery::one()|one()]] or [[yii\elasticsearch\ActiveQuery::all()|all()]]. The search() method returns not only the models, but also query metadata: shard statistics, aggregations, etc. When using aggregations, the search results (hits) themselves often don’t matter. That is why we’re using [[yii\elasticsearch\ActiveQuery::limit()|limit(0)]] to only return the metadata.

After some processing, $customersByDate contains data similar to this:

[
    '2020-01-01' => 5,
    '2020-02-01' => 3,
    '2020-03-01' => 17,
]

Suggesters

Sometimes it is necessary to suggest search terms that are similar to the search query and exist in the index. For example, it might be useful to find known alternative spellings of a name. See the example below, and also Elasticsearch docs for details.

$searchResult = Customer::find()->limit(0)
->addSuggester('customer_name', [
    'text' => 'Hans',
    'term' => [
        'field' => 'name',
    ]
])->search();

// Note that limit(0) will prevent the query from returning hits,
// so only suggestions are returned

$suggestions = ArrayHelper::map($searchResult["suggest"]["customer_name"], 'text', 'options');
$names = ArrayHelper::getColumn($suggestions['Hans'], 'text');
// $names == ['Hanns', 'Hannes', 'Hanse', 'Hansi']

Unusual behavior of attributes with object mapping

The extension updates records using the _update endpoint. Since this endpoint is designed to perform partial updates to documents, all attributes that have an “object” mapping type in Elasticsearch will be merged with existing data. To demonstrate:

$customer = new Customer();
$customer->my_attribute = ['foo' => 'v1', 'bar' => 'v2'];
$customer->save();
// at this point the value of my_attribute in Elasticsearch is {"foo": "v1", "bar": "v2"}

$customer->my_attribute = ['foo' => 'v3', 'bar' => 'v4'];
$customer->save();
// now the value of my_attribute in Elasticsearch is {"foo": "v3", "bar": "v4"}

$customer->my_attribute = ['baz' => 'v5'];
$customer->save();
// now the value of my_attribute in Elasticsearch is {"foo": "v3", "bar": "v4", "baz": "v5"}
// but $customer->my_attribute is still equal to ['baz' => 'v5']

Since this logic only applies to objects, the solution is to wrap the object into a single-element array. Since to Elasticsearch a single-element array is the same thing as the element itself, there is no need to modify any other code.

$customer->my_attribute = [['new' => 'value']]; // note the double brackets
$customer->save();
// now the value of my_attribute in Elasticsearch is {"new": "value"}
$customer->my_attribute = $customer->my_attribute[0]; // could be done for consistency

For more information see this discussion: discuss.elastic.co/t/updating-an-o...

Working with data providers

The extension comes with its own enhanced and optimized [[\yii\elasticsearch\ActiveDataProvider|ActiveDataProvider]] class. The enhancements include:

  • Total record count is obtained from the same query that gets the records themselves, not in a separate query.
  • Aggregation data is available as a property of the data provider.

While [[\yii\elasticsearch\Query]] and [[\yii\elasticsearch\ActiveQuery]] can be used with [[\yii\data\ActiveDataProvider]], this is not recommended.

NOTE: The data provider fetches result models and total count using single Elasticsearch query, so results total count will be fetched after pagination limit applying, which eliminates ability to verify if requested page number actually exist. The data provider disables [[yii\data\Pagination::$validatePage]] automatically because of this.

Usage examples

use yii\elasticsearch\ActiveDataProvider;
use yii\elasticsearch\Query;

// Using Query
$query = new Query();
$query->from('customer');

// ActiveQuery can also be used
// $query = Customer::find();

$query->addAggregate(['date_histogram' => [
    'field' => 'registered_at',
    'calendar_interval' => 'month',
]]);

$query->addSuggester('customer_name', [
    'text' => 'Hans',
    'term' => [
        'field' => 'customer_name',
    ]
]);

$dataProvider = new ActiveDataProvider([
    'query' => $query,
    'pagination' => [
        'pageSize' => 10,
    ]
]);

$models = $dataProvider->getModels();
$aggregations = $dataProvider->getAggregations();
$suggestion = $dataProvider->getSuggestions();

Using the Elasticsearch DebugPanel

The yii2 Elasticsearch extension provides a DebugPanel that can be integrated with the yii debug module and shows the executed Elasticsearch queries. It also allows to run these queries and view the results.

Add the following to you application config to enable it (if you already have the debug module enabled, it is sufficient to just add the panels configuration):

    // ...
    'bootstrap' => ['debug'],
    'modules' => [
        'debug' => [
            'class' => 'yii\\debug\\Module',
            'panels' => [
                'elasticsearch' => [
                    'class' => 'yii\\elasticsearch\\DebugPanel',
                ],
            ],
        ],
    ],
    // ...

Using the Elasticsearch DebugPanel

The yii2 Elasticsearch extension provides a DebugPanel that can be integrated with the yii debug module and shows the executed Elasticsearch queries. It also allows to run these queries and view the results.

Add the following to you application config to enable it (if you already have the debug module enabled, it is sufficient to just add the panels configuration):

    // ...
    'bootstrap' => ['debug'],
    'modules' => [
        'debug' => [
            'class' => 'yii\\debug\\Module',
            'panels' => [
                'elasticsearch' => [
                    'class' => 'yii\\elasticsearch\\DebugPanel',
                ],
            ],
        ],
    ],
    // ...

Elasticsearch DebugPanel

💖喜欢本文档的,欢迎点赞、收藏、留言或转发,谢谢支持!
作者邮箱:zhuzixian520@126.com,github地址:github.com/zhuzixian520

本文章首发在 LearnKu.com 网站上。

上一篇 下一篇
zhuzixian520
讨论数量: 0
发起讨论 只看当前版本


暂无话题~