Project Requirements 1

Automatically download at least 500 English documents/web pages and 50 Chinese documents/web pages through the download engine (Web Crawler/Spider). Keep the original document/web backup (eg News_1_Org.txt). Programming automatically preprocesses the downloaded document:

Characterize each word to complete the operation of deleting special characters.
Investigate and select appropriate Chinese word segmentation techniques and tools to implement Chinese word segmentation. See “Lucene tokenizer comparison”
Delete English stop words (Stop Word)
Delete Chinese stop words
Call or program to implement Porter Stemming function
Characterize the Chinese document, which can be indexed by the search engine.
For the English document, after the above processing, the simplified document formed after processing (for example, News_1_E.txt) is saved for later index processing.
For Chinese documents, after the above processing, the simplified documents formed after processing are saved (for example, News_1_C.txt) for later index processing.

Notes:

Integrate all steps as much as possible into a complete automation system. If not, program each step (module) and pass the results using an intermediate file. The download engine has extra points if it is written by itself. Try to download and process more pages for later search. Use the stop vocabulary provided, or the stop vocabulary generated by yourself based on different apps.

Requirements 2

Establish and implement a text search function

Use / call the open source search engine Lucene or Lemur to implement a text search engine. Check the relevant information to install the software.
Search and implement search function on 500 pre-processed English and Chinese documents/web pages.
Index the document through the above software, and then enter the keyword through the front interface or the provided interface to display the search function.
The front desk can be displayed via web form, application form, or using existing interface tools.
English search must be implemented. The Chinese search function is available as an option.

Comparing similarities between documents

The similarity between any two documents is calculated by the Cosine Distance. List the original text of the document and give the similarity value.

Clustering the downloaded documents using the K-Means clustering algorithm

Gather the downloaded 500 Chinese/English documents into 20 categories and display the three largest classes formed after clustering, and the representative documents in each class (ie, the five closest to the class center) Documentation).
The distance calculation formula can use cosine distance or Euclidean metric.

Chenyang Shi

Project Requirements 1

Notes:

Requirements 2