Python implementation of some algorithms for Correlation Clustering. Specifically:
- Linear-programming + region-growing O(log n)-approximation algorithms for general weighted graphs
round_demaineinsrc/pyccalg.py: Demaine et al.'s rounding algorithmround_charikarinsrc/pyccalg.py: Charikar et al.'s rounding algorithm
kwikclusterinsrc/pyccalg.py:KwikClusterrandomized, linear-time algorithm (Ailon et al., JACM 2008), achieving constant-factor approximation guarantees on complete graphs satisfying certain constraints (e.g., probability constraint and/or triangle-inequality constraint)
Python v3.6+- For linear-programming-based algorithms:
SciPyv1.6+and/orPuLPSciPy linprogcomes with various solvers: 'Methodhighs-dsis a wrapper of the C++ high performance dual revised simplex implementation (HSOL). Methodhighs-ipmis a wrapper of a C++ implementation of an interior-point method; it features a crossover routine, so it is as accurate as a simplex solver. Methodhighschooses between the two automatically. For new code involving linprog, we recommend explicitly choosing one of these three method values instead ofinterior-point(default),revised simplex, andsimplex(legacy)'. See here for more details.PuLPcomes with two solvers by default:CBC(linear and integer programming) andCHOCO(constraint programming), but it can connect to many others (e.g.,GUROBI,CPLEX,SCIP,MIPCL,XPRESS,GLPK9) if you have them installed- Here we use
highs-ipmwithSciPy linprogand the defaultCBCwithPuLP - However, any linear-programming
Python(other thanSciPy linprogorPuLP) library can alternatively be used with minimal adaption
python src/pyccalg.py -d <DATASET_FILE> [-r <LB,UB>] [-a <PROB>] [-s {'pulp','scipy'}] [-m {'charikar','demaine','kwik'}]
- Optional arguments:
-r <LB,UB>, if you want to generate random edge weights from[LB,UB]range-a <PROB>, if you want to randomly add edges with probabilityPROB-m {'charikar','demaine','kwik'}, to choose the algorithm (default:'charikar'). NOTE:kwikclusteris always run too-s {'pulp','scipy'}, to select the solver to be used (default:'scipy'(it seems faster))
- Dataset-file format:
- First line:
#VERTICES \t #EDGES - One line per edge; every line is a quadruple:
NODE1 \t NODE2 \t POSITIVE_WEIGHT \t NEGATIVE_WEIGHT(POSITIVE_WEIGHTandNEGATIVE_WEIGHTare ignored if code is run with-roption) - Look at
datafolder for some examples
- First line: