January 31, 2011

Sphinx over Windows - Problems & Solutions

There were some basic problems which I faced and I have seen in the Sphinx Forum that lot of people face the same while running Sphinx Search Server on Windows. Here they are along with the solutions:

Problem 1

ERROR: connection to localhost:9312 failed (errno=10060, msg=A connection
attempt failed because the connected party did not properly respond after a period of
time, or established connection failed because connected host has failed to respond

In the hosts file located at \Windows\System32\drivers\etc

Uncomment the following line

127.0.0.1 localhost

For some reason this line is sometimes commented out!

Problem 2

The query doesn’t reach searchd (which is running) from your test.php file

In the test.php file make the following changes:

$host = "127.0.0.1"; instead of $host = "localhost";

Problem 3

• Service not started
• Error 1067...

Note that when Sphinx runs in daemon mode (i.e. when you start it from your command
prompt) , sphinx.pid file and log files dont get generated hence when you work with
options like index rotation it says:

WARNING: failed to open pid from pid_file Sphinx\log\searchd.pid.
WARNING: indices NOT rotated.

This error can be removed by starting Sphinx as a Service in Windows and not from command
prompt

C:\Sphinx> C:\Sphinx\searchd --install --config C:\Sphinx\sphinx.conf --servicename
SphinxSearch

If you get the following error while starting the service from the windows task manager
or using 'net' command:

The SphinxSearch service could not be started.

A system error has occurred.

System error 1067 has occurred.

For this error just check out your Windows Event Logs (found @ Control Panel -> System &
Security -> Administrative Tools (View Event Logs):

This will tell exactly about the problem. Mostly the problem for the above error is that
the path to sphinx.conf is not correct.

Since people usually copy and paste this command (from the Sphinx Manual) for starting the searchd service:

C:\Sphinx> C:\Sphinx\searchd --install --config C:\Sphinx\sphinx.conf.in --servicename
SphinxSearch

Notice the sphinx.conf.in instead of sphinx.conf in the above command.

The windows command prompt executes this command just after pasting on command line (because of the return key character in the command) and yes then the service is listed in your Task Manager. Now your service refers to an invalid conf file and thus does not start.

Delete the service and install again with correct conf file name.


January 30, 2011

Sphinx v/s Microsoft Search Server - Part 2

In my last post I had compared Sphinx & Microsoft Sharepoint Search with respect to indexing time & sizes for a fresh/full indexing. Sphinx completely dominated Microsoft Sharepoint Search by leaps and bounds. Here are the results for incremental indexing:

Sharepoint Statistics
Test 1
Total Number of Records: 1 million (already indexed) + 50000 new records
Time Taken for new records to be searchable: 2 min:21 sec

Sphinx Statistics
Test 1
Total Number of Records: 1 million + 50000 new records
Time Taken for records to be searchable: 0.2 seconds

Test 2
Total Number of Records: 10 million + 75000 new records
Time Taken for records to be searchable: 0.8 second

Sphinx Incremental Indexing Tests in Detail

2 step process:

1: Incremental Indexing
Sphinx supports "live" (almost real time) index updates and it could be implemented using so called "main+delta" scheme. The idea is to set up two sources and two indexes, with one "main" index for the data, and one "delta" for the new documents. Say for example we have some X Million records so we can keep that as the main index and all the new documents get added to a new table which will act as the delta. This new table can be indexed from time to time (depending on application) and the data gets searchable within seconds.

Tests carried out: Main Index: 10 million records
I created a new table (delta) and 75000 new documents were added in that table.
Time Taken by Sphinx to index and make the delta searchable: 0.8 seconds.

2: Merging
Depending upon our search requirements we can perform the merge of 2 indexes (i.e. main + delta) as and when needed and make the delta table empty.
Merging of above 10 million records & newly added 75000 records took 30 sec.

Conclusion:

Sphinx is a great Information Retrieval System! If you love Algorithms you will definitely love to see the Sphinx code (written in C++) as the data structures used and running times are highly optimized. Thanks to Andrew Aksyonoff for creating a wonderful product.


January 25, 2011

Sphinx v/s Microsoft Search Server - Part 1

I have been playing with the Sphinx Search Engine on the Linux Ubuntu since long time. Recently I had to embed a vertical search over a database that runs on Windows.

I tried Microsoft Sharepoint Search Server & Sphinx and did some benchmarking.


System Configuration:

Operating System: Windows Server 64 Bit
Processor: Quad Core AMD Opteron 2356 (2 processors/2.3GHz)
Memory: 4 GB RAM
Cache per processor: L1 (data) = 64 KB, L2 = 512 KB, L3 = 6MB
Database: MS SQL Server for Sharepoint, MySQL for Sphinx

Database Size:

I did the benchmarking for up to [15 million rows of data X 32 columns] for Sphinx. However for Microsoft Sharepoint Search it was just done for [1 million rows X 32 columns]. You would soon come to know the reason for this.

Here you go:

Sphinx

Data Size: 1 million rows X 32 columns
Time to Index: 95 seconds
Time To Search: 0.001 to 0.01 sec
Index Size: 0.14GB

Microsoft Sharepoint Search Server 2010

Data Size: 1 million rows X 32 columns
Time To Index: 3hrs:46 minutes
Time To Search: 0.001 to 0.01 sec
Index Size: 3.4GB

Now after seeing the above results (please compare the index size also for both & not just time to index) you would have come to know why I didnt carry out any further tests with Sharepoint Search :)

I continued with Sphinx and completed the benchmarking for 15 million rows X 32 columns.

Here is the report:

Sphinx Tests

Data Size: 5 million rows X 32 columns
Time To Index: 6 minutes
Time to Search: 0.001 to 0.01 seconds
Sphinx Index Size: 0.75GB

Data Size: 10 million rows X 32 columns
Time To Index: 18 minutes
Time to Search: 0.001 to 0.05 seconds
Sphinx Index Size: 1.45GB

Data Size: 15 million rows X 32 columns
Time To Index: 24 minutes
Time to Search: 0.001 to 0.05 seconds
Sphinx Index Size: 2.25GB

Soon I will be posting a similar report on Incremental Indexing/Crawling for both Microsoft Sharepoint & Sphinx Search Server.