Chinese Indexing in Solr

Some of our SearchStax clients index websites that use multiple languages. We were recently asked how to enable Solr indexing of Mandarin on a cloud platform. (This post describes indexing Traditional Chinese characters. It is also possible to use Simplified Chinese by following a similar series of steps. Contact us at support@searchstax.com for an example.)

 
Solr does not parse Chinese text by default, but it comes with the appropriate tokenizers included. The default configuration of the ICU Tokenizer is suitable for Traditional Chinese text. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described below).
 

Step 1: Obtain Configuration Files.

To add Traditional Chinese indexing to your Solr project, you need to modify your project configuration files. If you need to download the files from an existing project, see How can I view my Zookeeper Configurations?

Step 2. Add the Required Library.

Update solrconfig.xml file by adding following line after all the lib declarations.

				
					<!-- Traditional Chinese library -->
     <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex="lucene-analyzers-icu-\d.*\.jar" />
     <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib" regex="icu4j-\d.*\.jar" />
<!-- Traditional Chinese library - END -->
				
			
This library comes with Solr, so you don’t have to alter your deployment in any way to make it work.
 

Step 3. Update the Schema

A. Create a new field type in the managed-schema file with the SmartChineseAnalyzer.

<fieldType name="text_mandarin" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

B. Create a field that uses this field type.

<field name=”text_man” type=”text_mandarin” multiValued=”true” indexed=”true” stored=”true”/>

Step 4: Upload Configuration and Reload Collection

Upload the altered configuration to your SearchStax cloud server and reload your collection. See How do I update the Solr Schema? for step-by-step instructions.

About the Author

By Karan Jeet Singh

Solutions Engineer

July 3, 2019

Recommended for you

Get the Latest Content First