使用Lucene+Paoding构建SSH2系统的站内搜索

九月 22, 2008 | 标签 lucene  paoding  ssh2   | 浏览
评论 0
目标:创建一个具有高度可移植的,定时创建索引的站内搜索。
途径:dic和index都放到程序中去。

准备:
1 Lucene
Lucene Java(以下简称Lucene)目前可用版本是2.4.0,关于Lucene的详细信息请查看http://lucene.apache.org/java/docs/index.html


2 Paoding
Qieqie同学的伟大作品、优秀的Lucene中文分词组件,目前的版本为paoding-analysis-2.0.4-beta,对应的Lucene的版本为2.2。关于Paoding的具体信息请查看http://code.google.com/p/paoding/


3 下载最新的paoding-analysis-2.0.4-beta版本(里面包含了lucene-core-2.2.0.jar, lucene-analyzers-2.2.0.jar,lucene-highlighter-2.2.0.jar, junit.jar, commons-logging.jar)。

>>>>> 本文为原创,需要转载的朋友请注明:http://www.doaction.cn/blog/post/java02.html 谢谢支持!<<<<<


开始工作:
1 试运行
打开下载包中的examples文件夹,运行一下吧(注意一下编码)。


2 集成到SSH2系统中去 (系统结构Action->service->dao)
1) 由于SSH2系统是web系统,因此在配置Paoding上就有可能和第一步有些不同。
直接把paoding文件夹下的src文件夹下的所有文件和dic文件夹复制到你的项目中去。打开paoding-dic-home.properties文件,修改paoding.dic.home.config-fisrt=this,使得程序知道该配置文件,修改paoding.dic.home=classpath:dic,使得字典在该项目中。保存就可以了。在这里我使用了classpath:dic是为了增加可移植性。如果使用绝对路径没有什么可说的了,但是如果你是制定为classpath:dic,则需要修改一下Paoding中的代码了。找到PaodingMaker.java的setDicHomeProperties方法,修改File dicHomeFile = getFile(dicHome);为
 File dicHomeFile2 = getFile(dicHome); 
String path=""; 
try { 
path = URLDecoder.decode(dicHomeFile2.getPath(),"UTF-8"); 
} catch (UnsupportedEncodingException e) { 
e.printStackTrace(); 
} 
File dicHomeFile = new File(path); 

目的是解码,不然如果你的词典路径中有空格和汉字会出现找不到字典的异常。

2)表结构
 CREATE TABLE `news` ( 
`id` int(11) NOT NULL auto_increment, 
`title` varchar(255) default NULL, 
`details` mediumtext, 
`author` varchar(255) default NULL, 
`publisher` varchar(100) default NULL, 
`clicks` int(11) default NULL, 
`source` varchar(255) default NULL, 
`addtime` datetime default NULL, 
` category ` varchar(100) default NULL, 
`keywords` varchar(255) default NULL, 
PRIMARY KEY (`id`) 
) ENGINE=InnoDB DEFAULT CHARSET=gbk; 

3 正式实施编码
编写站内搜索分为两步:创建索引和进行搜索,所需类:SearchAction.java和TaskAction.java(同一目录)
1) 创建索引
主要任务:从已有的txt文件中读取上一次进行索引的最后一条新闻的id号,然后从业务逻辑中查找大于这个id号的所有新闻进行索引,最后把这次最后的一条新闻id写入txt文件中。在这里要处理好路径的问题。在这里所有的记录id号的txt文件都放到了action目录下面。
新建TaskAction,增加如下方法
 public void createIndex() { 
String path; 
try { 
//两个参数:创建索引的位置 和 上一次创建索引最后的新闻id所在文件 
createNewsIndex(getPath(TaskAction.class, "date/index/news"),"newsid.txt"); 
} catch (Exception e) { 
e.printStackTrace(); 
} 
} 

public String getPath(Class clazz, String textName) 
throws IOException { 
String path = (URLDecoder.decode( 
clazz.getResource(textName).toString(), "UTF-8")).substring(6); 
return path; 
} 

public void createNewsIndex(String path,String textName) throws Exception { 
String newsId = "0"; 

newsId = readText(TaskAction.class, textName); 
if (null ==newsId || "".equals(newsId)) 
newsId = "0"; 

// 使用paoding中文分析器 
Analyzer analyzer = new PaodingAnalyzer(); 
FSDirectory directory = FSDirectory.getDirectory(path); 
System.out.println(directory.toString()); 
IndexWriter writer = new IndexWriter(directory, analyzer, isEmpty(TaskAction.class, textName)); 
Document doc = new Document(); 

// 从业务逻辑层读取大于当前id的信息 
List list = newsManageService.getNewsBigId(Integer.parseInt(newsId)); 
Iterator iterator = list.iterator(); 
News news = new News(); 
while (iterator.hasNext()) { 
doc = new Document(); 
news = (News) iterator.next(); 
doc.add(new Field("id", "" + news.getId(), Field.Store.YES, 
Field.Index.UN_TOKENIZED)); 
doc.add(new Field("title", "" + news.getTitle(), Field.Store.YES, 
Field.Index.TOKENIZED)); 
doc.add(new Field("author", "" + news.getAuthor(), Field.Store.YES, 
Field.Index.TOKENIZED)); 
doc.add(new Field("details", "" 
+ Constants.splitAndFilterString(news.getDetails()), 
Field.Store.YES, Field.Index.TOKENIZED, 
Field.TermVector.WITH_POSITIONS_OFFSETS)); 
doc.add(new Field("addtime", "" + news.getAddtime(), 
Field.Store.YES, Field.Index.TOKENIZED)); 
doc.add(new Field("keywords", "" + news.getKeywords(), 
Field.Store.YES, Field.Index.TOKENIZED)); 
System.out.println("Indexing file " + news.getName() + "..."); 
articleId = String.valueOf(news.getId()); 
try { 
writer.addDocument(doc); 
} catch (IOException e) { 
e.printStackTrace(); 
} 
} 
// 优化并关闭 
writer.optimize(); 
writer.close(); 

// 将我索引的最后一篇文章的id写入文件 
String content = WriteText(TaskAction.class, 
textName, newsId); 
} 

public boolean isEmpty(Class clazz, String textName) throws Exception { 
String articleId = "0"; 
boolean isEmpty = true; 
articleId = ContentReader.readText(clazz, textName); 
if (null == articleId || "".equals(articleId)) 
articleId = "0"; 
if (!articleId.equals("0")) 
isEmpty = false; 
System.out.println(clazz.getName()+" "+isEmpty); 
return isEmpty; 
} 

//该方法参考了paoding中example中的一个方法。 
public String readText(Class clazz, String textName) 
throws IOException { 
InputStream in = clazz.getResourceAsStream(textName); 
Reader re = new InputStreamReader(in, "UTF-8"); 
char[] chs = new char[1024]; 
int count; 
String content = ""; 
while ((count = re.read(chs)) != -1) { 
content = content + new String(chs, 0, count); 
} 
return content; 
} 

public String WriteText(Class clazz, String textName, String text) 
throws IOException { 
String path = (URLDecoder.decode( 
clazz.getResource(textName).toString(), "UTF-8")).substring(6); 
System.out.println(path); 
File file = new File(path); 
BufferedWriter bw = new BufferedWriter(new FileWriter(file)); 
String temp = text; 
bw.write(temp); 
bw.close(); 
return temp; 
} 

2)进行搜索
 public void searchIndex(String path, String keywords) throws Exception { 
String[] FIELD = { "title", "details" }; 
String QUERY = keywords; 

Analyzer analyzer = new PaodingAnalyzer(); 
FSDirectory directory = FSDirectory.getDirectory(path); 
IndexReader reader = IndexReader.open(directory); 
String queryString = QUERY; 
BooleanClause.Occur[] flags = new BooleanClause.Occur[] { 
BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD }; 
Query query = MultiFieldQueryParser.parse(queryString, FIELD, flags, 
analyzer); 

Searcher searcher = new IndexSearcher(directory); 
query = query.rewrite(reader); 
System.out.println("Searching for: " + query.toString()); 
Hits hits = searcher.search(query); 

NewsDTO news = new NewsDTO(); 
String highLightText = ""; 

for (int i = 0; i < hits.length(); i++) { 

Document doc = hits.doc(i); 
String title1 = doc.get("title"); 
String contents1 = doc.get("details"); 

SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter( 
"", ""); 

Highlighter highlighter = new Highlighter(simpleHTMLFormatter, 
new QueryScorer(query)); 
highlighter.setTextFragmenter(new SimpleFragmenter(200)); 

if (contents1 != null) { 
TokenStream tokenStream = analyzer.tokenStream("details", 
new StringReader(contents1)); 
highLightText = highlighter.getBestFragment(tokenStream, 
contents1); 
} 
news = new NewsDTO(); 
news.setId(Integer.parseInt(doc.get("id"))); 
news.setName(doc.get("title")); 
news.setDetails(highLightText); 
news.setAddtime(doc.get("addtime")); 
news.setAuthor(doc.get("author")); 
searchResultItem.add(news); 
} 
reader.close(); 

} 

核心代码已经基本完成了,还有一个加亮显示,非常不错的哦。

3)再来一个定时创建索引:
定义一下bean
 <bean id="myTask" class="edu.cumt.jnotnull.action.TaskAction"> 
<property name="newsManageService"> 
<ref bean="newsManageService" /> 
</property> 
</bean> 

<bean id="entity" 
class="org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean"> 
<property name="targetObject"> 
<ref local="myTask" /> 
</property> 
<property name="targetMethod"> 
<value>createIndex</value> 
</property> 
</bean> 

<bean id="cron" 
class="org.springframework.scheduling.quartz.CronTriggerBean"> 
<property name="jobDetail"> 
<ref bean="entity" /> 
</property> 
<property name="cronExpression"> 
<value>0 0-5 2 * * ?</value> 
</property> 
</bean> 

<bean autowire="no" 
class="org.springframework.scheduling.quartz.SchedulerFactoryBean"> 
<property name="triggers"> 
<list> 
<ref local="cron" /> 
</list> 
</property> 
</bean> 

这样就可以在夜里面让他自动促发了。


    相关文章:



发表评论:

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。